# DS5K
## Project 3: Using ML to Predict Illegal U.S. Southwest Cross-Border Activity

'''
My intent is to map out previous years' land-based border activity (since FY 2017) and use ML to predict where future cross-border activity would be most prevalent and at what times of the year. My overall goal is to map out trends in illegal border activity to assist with coordinating U.S. Customs and Border Protection response planning.

Data sets for each year:

U.S. Border Patrol Southwest Border Apprehensions by Sector | U.S. Customs and Border Protection (cbp.go
https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions
- **FY 2021 (October 1, 2020 - March 31, 2021)**
 
U.S. Border Patrol Southwest Border Apprehensions by Sector Fiscal Year 2020 | U.S. Customs and Border Protection (cbp.gov)
https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020
- **FY 2020 (October 1, 2019 - September 30, 2020)**
 
U.S. Border Patrol Southwest Border Apprehensions by Sector Fiscal Year 2019 | U.S. Customs and Border Protection (cbp.gov)
https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019
- **FY 2019 (October 1, 2018 - September 30, 2019)**
 
U.S. Border Patrol Southwest Border Apprehensions by Sector FY2018 | U.S. Customs and Border Protection (cbp.gov)
https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions
- **FY 2018 (October 1, 2017 - September 30, 2018)**
 
U.S. Border Patrol Southwest Border Apprehensions by Sector FY2017 | U.S. Customs and Border Protection (cbp.gov)
https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017
- **FY 2017 (October 1, 2016 - September 30, 2017)**"
 
As you begin EDA and reviewing the data look for elements present that you can 'predict' (ie. months when crossings occur, days when crossings occur, etc. ). Look for correlations, etc.
'''

"This predictor seems to be highly associated with this outcome..."

USBP and OFO official year end reporting for FY20; USBP and OFO end of month reporting for FY21TD. Data is current as of 7/6/21.

# sklearn

#### Import
#### Instantiate
#### Fit
#### Predict

In [68]:
# Import libraries to interact with html, tables, and plotting data

import pandas as pd
import numpy as np # used for linear algebra and random sampling
import seaborn as sns
from matplotlib import pyplot as plt
import bs4
from bs4 import BeautifulSoup
from html.parser import HTMLParser
import glob
import requests
import urllib.request
from urllib.request import Request, urlopen
import requests
import lxml
import html5lib
import webbrowser
import joblib

# sklearn libraries for ML
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.datasets import load_boston
from sklearn.dummy import DummyRegressor
from sklearn import model_selection
from sklearn.model_selection import cross_val_score

# used for plotting charts within the notebook (instead of a separate window)
# Allow plots to appear in the notebook.
%matplotlib inline

# Using Pickle to save data models in their current state so a ML model retrain is not necessary
import pickle

# !pip install requests
# !pip install requests-html
# !pip install lxml
# !pip install html5lib
# !pip install tensorflow
# import tensorflow as tf

print(f'Pandas v{pd.__version__}')
print(f'Numpy v{np.__version__}')
print(f'joblib v{joblib.__version__}')
print(f'sklearn v{sklearn.__version__}')
# print(f'TensorFlow v{tf.__version__}')

Pandas v1.3.4
Numpy v1.21.4
joblib v1.1.0
sklearn v1.0.1


In [86]:
data_FY2021_url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions"
data_FY2020_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020"
data_FY2019_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019"
data_FY2018_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions"
data_FY2017_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017"

In [87]:
websites = [data_FY2021_url, data_FY2020_url, data_FY2019_url, data_FY2018_url, data_FY2017_url]
websites

['https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions',
 'https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020',
 'https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019',
 'https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions',
 'https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017']

In [89]:
tables_list = []
def website_tables(websites):
    for website in websites:
        resp = requests.get(website)

        if resp.status_code == 200:
            print(f'Status code: {resp.status_code} - {website} was successfully processed\n')
            req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
            webpage = urlopen(req).read()
            table = pd.read_html(webpage)
            tables_list.append(table)
        else:
            print(f'ERROR: Status code {resp.status_code} on website {website}\n')
            continue

website_tables(websites)
print(tables_list)
# print(tables_list)

Status code: 200 - https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions was successfully processed

Status code: 200 - https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020 was successfully processed

Status code: 200 - https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019 was successfully processed

Status code: 200 - https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions was successfully processed

Status code: 200 - https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017 was successfully processed

[[  Unaccompanied Children Encounters by Sector                          \
                                       Sector FY20 TD MAR FY21 TD MAR   
0                                    Big Bend         254         845   
1                                     Del Rio        1166        3431   
2                                   El Centro         715

In [50]:
# table = soup.find_all('table')
# df = pd.read_html(str(table))

In [78]:
resp = requests.get("https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions")

soup = BeautifulSoup(resp.text,'html.parser')
print(resp)
print(soup)
soup.body
resp.status_code
# requests.status_codes
# resp_code  = resp.status_code(data_FY2021_url)
# resp_code.


<Response [200]>
<!DOCTYPE html>

<html dir="ltr" lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://www.cbp.gov/profiles/cbp_gov/themes/cbp_gov_theme/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<!--[if gt IE 8]><html class="" />
<![endif]--><!--[if IE 8]><html class="ie8" />
<![endif]--><script type="text/javascript">dataLayer = [{"entityType":"node","entityBundle":"accordion_page","entityId":"365977","entityLabel":"U.S. Border Patrol Southwest Border Apprehensions by Sector","entityLanguage":"en","entityTnid":"0","entityVid":"1086246","entityCreated":"1605038683","entityTaxonomy":{"tags":{"68":"Statistics","162":"Unaccompanied Children (UC)","42":"U.S. Border Patrol"}},"drupalLanguage":"en","userUid":0}];</script>
<meta content="width=device-width, initial-scale=1, maximum-scale=5.5, minimum-scale=1, user-scalable=yes" name="viewport"/>
<meta content="NOTE: This webpage is no longer updated." name="descrip

200

In [54]:
# pd.read_html(data_FY2020_url)

In [55]:
# Open with GET method
resp = requests.get(data_FY2021_url)

In [56]:
print(resp)

<Response [200]>


In [57]:
# def websites():
#     # The websites we want to scrape
#     # Need to put these into a dictionary/list to call later, not sure which one just yet
#     data_FY2021_url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions"
#     data_FY2020_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020"
#     data_FY2019_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019"
#     data_FY2018_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions"
#     data_FY2017_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017"
    
#     # Open with GET method
#     resp = requests.get(data_FY2021_url)
    
#     # http_response 200 means OK status
#     if resp.status_code == 200:
#         print("Successfully opened webpage")
#         print("Here is your webpage:-\n")
        
#         # We need a parser, Python built-in HTML parser will work
#         soup = BeautifulSoup(resp.text,'html.parser')
        
#     else:
#         print(f"Error, your response code was {resp.status_code}")
# websites()

In [58]:
def websites_data(URL):
    # The websites we want to scrape
    # Need to put these into a dictionary/list to call later, not sure which one just yet
    data_FY2021_url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions"
    data_FY2020_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020"
    data_FY2019_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019"
    data_FY2018_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions"
    data_FY2017_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017"
    
    if resp.status_code == 200:
        print("Successfully opened webpage")
        print("Here is your webpage:-\n")
        # Open with GET method
        resp = requests.get(URL)

In [59]:
websites

['https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions',
 'https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020',
 'https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019',
 'https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions',
 'https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017']

In [60]:
def websites():
    # The websites we want to scrape
    # Need to put these into a dictionary/list to call later, not sure which one just yet
    data_FY2021_url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions"
    data_FY2020_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020"
    data_FY2019_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019"
    data_FY2018_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions"
    data_FY2017_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017"
    
    # Open with GET method
    resp = requests.get(data_FY2021_url)
    
    # http_response 200 means OK status
    if resp.status_code == 200:
        print("Successfully opened webpage")
        print("Here is your webpage:-\n")
        
        # We need a parser, Python built-in HTML parser will work
        soup = BeautifulSoup(resp.text,'html.parser')
        
        # text_list is the list that contains all the text
        text_list = soup.find("ul",{"class":"searchNews"})
        print(text_list)
        
        # Now we want to print only the text part of the anchor.
        # Find all the elements of a, i.e. anchor
#         for i in text_list.findAll("dt"):
#             print(i.text)
    else:
        print("Error")
websites()

Successfully opened webpage
Here is your webpage:-

None


In [61]:
soup = BeautifulSoup(resp.text,'html.parser')
soup.head.title

<title>U.S. Border Patrol Southwest Border Apprehensions by Sector | U.S. Customs and Border Protection</title>

In [62]:
soup.body.a.text

'Skip to main content'

In [63]:
soup.body.p.b     # returns <b>Body's title</b>

In [64]:
soup.body.div

<div id="skip-link">
<a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>
</div>

In [65]:
soup.body.thead

<thead><tr><th colspan="4" scope="col" style="text-align: center;">Unaccompanied Children Encounters by Sector</th></tr><tr><th scope="col" style="text-align: left;">Sector</th><th scope="col" style="text-align: center;">FY20 TD MAR</th><th scope="col" style="text-align: center;">FY21 TD MAR</th><th scope="col" style="text-align: center;">% Change<br/>FY20 TD MAR to FY21 TD MAR</th></tr></thead>

In [66]:
soup.body.tbody

<tbody><tr><td>Big Bend</td><td style="text-align: center;">254</td><td>           845</td><td style="text-align: center;">233%</td></tr><tr><td>Del Rio</td><td style="text-align: center;">1,166</td><td>        3,431</td><td style="text-align: center;">194%</td></tr><tr><td>El Centro</td><td style="text-align: center;">715</td><td>        1,047</td><td style="text-align: center;">46%</td></tr><tr><td>El Paso</td><td style="text-align: center;">2,598</td><td>        8,636</td><td style="text-align: center;">232%</td></tr><tr><td>Laredo</td><td style="text-align: center;">1,607</td><td>        2,045</td><td style="text-align: center;">27%</td></tr><tr><td>Rio Grande</td><td style="text-align: center;"> 6,351</td><td>      20,964</td><td style="text-align: center;">230%</td></tr><tr><td>San Diego</td><td style="text-align: center;">1,040</td><td>        1,888</td><td style="text-align: center;">82%</td></tr><tr><td>Tucson</td><td style="text-align: center;">3,859</td><td>        7,079</td

In [67]:
# URLs from the U.S. Customs and Border Protection website with data tables 

# def websites():
#     # The websites we want to scrape
#     # Need to put these into a dictionary/list to call later, not sure which one just yet
#     data_FY2021_url = "https://www.cbp.gov/newsroom/stats/southwest-land-border-encounters/usbp-sw-border-apprehensions"
#     data_FY2020_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2020"
#     data_FY2019_url = "https://www.cbp.gov/newsroom/stats/sw-border-migration/usbp-sw-border-apprehensions-fy2019"
#     data_FY2018_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions"
#     data_FY2017_url = "https://www.cbp.gov/newsroom/stats/usbp-sw-border-apprehensions-fy2017"
    
#     # Open with GET method
#     resp = requests.get(data_FY2021_url)
    
#     # http_response 200 means OK status
#     if resp.status_code == 200:
#         print("Successfully opened webpage")
#         print("Here is your webpage:-\n")
        
#         # We need a parser, Python built-in HTML parser will work
#         soup = BeautifulSoup(resp.text,'html.parser')
        
#         # text_list is the list that contains all the text
#         text_list = soup.find("ul",{"class":"searchNews"})
        
#         # Now we want to print only the text part of the anchor.
#         # Find all the elements of a, i.e. anchor
#         for i in text_list.findAll("a"):
#             print(i.text)
#     else:
#         print("Error")
# websites()

# # Import html file using BeautifulSoup
# with open(data_FY2021_url) as f:
#     # read file
#     content = f.read()
#     # parse html
#     soup = soup(content, 'html.parser')
#     # print Title tag
#     print(soup.title)

# HTMLParser.feed(' ',data_FY2021_url)

# # urllib.request.urlopen(url).read()

# webbrowser.open(data_FY2021_url)

# res = requests.get(data_FY2021_url)

# type(res)
# res.status_code == requests.codes.ok
# len(res.text)
# print(res.text[:250])

# res.raise_for_status()
# noStarchSoup = soup(res.text)
# type(noStarchSoup)


# soup_file = open(data_FY2021_url)
# soup_pull = soup(data_FY2021_url.read())