<a href="https://colab.research.google.com/github/lwallac2/Bank-Marketing/blob/main/Module13_Webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 13: Webscraping**

Up to this point, we have mainly learned how the most popular data analytics algorithms work, how to preprocess our data to get them to work, and how to configure these algorithms to get them to work faster and better. For that, we always had "canned" data available, meaning datasets that were already in some sort of .csv format, nicely tabulated, and ready for analysis.

But real life is harsh. Data doesn't usually show up in neat little (or big) .csv packaging. It is messy, crazy, and unstructured. Think, for example, about product ratings on Amazon.com, comments on YouTube or Instagram, or threads of tweets on Twitter, video responses to other videos on TikTok, likes/ dislikes and donations on Discord or Twitch, and so on. There's a whole lot of data there, and it can be incredibly useful. But 1. how do you get to it and 2. how do you analyze it? That's what this module is all about.

At the end of this module, you will be able to:
* Acquire data from a webpage
* Clean data obtained from a webpage
* Acquire data from an API

Let's go.

# **0. Preparation and Setup**
Well, we need our libraries again, this time for webscraping and for textual analysis, which we will do in the second half of this file.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# **1. Web Scraping**
Web scraping is the process of using code to extract content and data from a website. It's kind of like building your own web search and using it to scour the internet for appropriate data--for example, sites like LinkedIn or Indeed for the title of the job you want after you graduate, or sites like [catster.com](https://www.catster.com/), Youtube.com or Reddit's [CatAdvice subreddit](https://www.reddit.com/r/CatAdvice/) for, say, information about healthy cat food.

To perform web scraping, we will import the libraries shown below. The [urllib.request](https://docs.python.org/3/library/urllib.request.html) module is used to open URLs. The [Beautiful Soup package](https://pypi.org/project/beautifulsoup4/) is used to extract data from html files. The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.

This is an amended copy of the [Datacamp Tutorial on Web Scraping](https://www.datacamp.com/community/tutorials/web-scraping-using-python).

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

Now we specify the URL containing the dataset and pass it to urlopen() to get the html of the page.

**NOTE** that some pages will produce the following error: `HTTPError: HTTP Error 403: Forbidden`; the developers built in anti-scraping security code.

In [None]:
url = "https://www.hubertiming.com/results/2017GPTR10K"
# url = "http://help.websiteos.com/websiteos/example_of_a_simple_html_page.htm"
# url = "https://www.zyte.com/learn/what-is-web-scraping/"
html = urlopen(url)

Getting the html of the page is just the first step. Next step is to create a Beautiful Soup object from the html. This is done by passing the html to the BeautifulSoup() function. The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. The second argument 'lxml' is the html parser which assigns the Python objects to the appropriate tag delimiters.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

Now we use the soup object to extract interesting information about the website we are scraping such as getting the title of the page as shown below.

In [None]:
# Get the title
title = soup.title
print(title)

<title>Race results for the 2017 Intel Great Place to Run \ Urban Clash Games!</title>


You can also get the text of the webpage and quickly print it out to check if it is what you expect.

In [None]:
# Print out the text
text = soup.get_text()
print(soup.text)








Race results for the 2017 Intel Great Place to Run \ Urban Clash Games!





















 2017 Intel Great Place to Run 10K \ Urban Clash Games
 Hillsboro Stadium, Hillsboro, OR 
 June 2nd, 2017


                            





 Email
                        timing@hubertiming.com with results questions. Please include your bib number if you have it.


                    






Huber Timing Home





10K:


Finishers:
577


Male:
414


Female:
163









 5K Individual
 5K Team
 10K Individual
 10K Team
 Summary




Indvidual Results



10K Results



Search:

Search
Division:

Men
Women
Non Binary
Masters Men
Masters Women
Masters Non Binary

F 18-25
F 26-35
F 36-45
F 46-55
F Under 18
M 18-25
M 26-35
M 36-45
M 46-55
M 55+
M Under 18
 Team:

Unattached
COLUMBIA TEAM A
COLUMBIA TEAM B
COLUMBIA TEAM C
COLUMBIA TEAM D
COLUMBIA TEAM E
DTNA1
DTNA2
DTNA3
FXG1
INTEL TEAM A
INTEL TEAM B
INTEL TEAM C
INTEL TEAM D
INTEL TEAM E
INTEL TEAM F
INTEL TEAM G
INTEL TEAM H
INTEL 

Now, open a new tab on your web browser and go directly to the website you are scraping. Right-click into the website and, from the popup menu, select "Inspect." If you are in Chrome, this will open a developer view with many tabs on the right side of your screen. This will show you the code of the webpage (although you may have to open and close a number of expanders to see any actual HTML tags).

<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/webscraping_HTML_example.png" width="500">
</div>

You can use the find_all() method of soup to extract useful html tags within a webpage. Examples of useful tags include < a > for hyperlinks, < table > for tables, < tr > for table rows, < th > for table headers, and < td > for table cells. The code below shows how to extract all the hyperlinks within the webpage.


In [None]:
soup.find_all('a')

[<a href="mailto:timing@hubertiming.com">timing@hubertiming.com</a>,
 <a href="https://www.hubertiming.com">Huber Timing Home</a>,
 <a class="btn btn-primary btn-lg" href="/results/2017GPTR" role="button" style="margin: 0px 0px 5px 5px"><i aria-hidden="true" class="fa fa-user"></i> 5K Individual</a>,
 <a class="btn btn-primary btn-lg" href="/results/team/2017GPTR" role="button" style="margin: 0px 0px 5px 5px"><i aria-hidden="true" class="fa fa-users"></i> 5K Team</a>,
 <a class="btn btn-primary btn-lg" href="/results/team/2017GPTR10K" role="button" style="margin: 0px 0px 5px 5px"><i aria-hidden="true" class="fa fa-users"></i> 10K Team</a>,
 <a class="btn btn-primary btn-lg" href="/results/summary/2017GPTR10K" role="button" style="margin: 0px 0px 5px 5px"><i class="fa fa-stream"></i> Summary</a>,
 <a id="individual" name="individual"></a>,
 <a data-url="/results/2017GPTR10K" href="#tabs-1" id="rootTab" style="font-size: 18px">10K Results</a>,
 <a href="https://www.hubertiming.com/"><img

As you can see from the output above, html tags sometimes come with attributes such as class, src, etc. These attributes provide additional information about html elements. You can use a for loop and the get('"href") method to extract and print out only hyperlinks.

In [None]:
all_links = soup.find_all("a")
for link in all_links:
    print(link.get("href"))

mailto:timing@hubertiming.com
https://www.hubertiming.com
/results/2017GPTR
/results/team/2017GPTR
/results/team/2017GPTR10K
/results/summary/2017GPTR10K
None
#tabs-1
https://www.hubertiming.com/
https://facebook.com/hubertiming/
None


To print out table rows only, pass the 'tr' argument in soup.find_all().

In [None]:
# Print the first 7 rows for sanity check
rows = soup.find_all('tr')
print(rows [:7])

[<tr colspan="2">
<b>10K:</b>
</tr>, <tr>
<td>Finishers:</td>
<td>577</td>
</tr>, <tr>
<td>Male:</td>
<td>414</td>
</tr>, <tr>
<td>Female:</td>
<td>163</td>
</tr>, <tr class="header">
<th>Place</th>
<th>Bib</th>
<th>Name</th>
<th>Gender</th>
<th>City</th>
<th>Chip Time</th>
<th>Gun Time</th>
<th>Team</th>
</tr>, <tr data-bib="814">
<td>1</td>
<td>814</td>
<td>

                    JARED WILSON

                </td>
<td>M</td>
<td>TIGARD</td>
<td>36:21</td>
<td>36:24</td>
<td></td>
</tr>, <tr data-bib="573">
<td>2</td>
<td>573</td>
<td>

                    NATHAN A SUSTERSIC

                </td>
<td>M</td>
<td>PORTLAND</td>
<td>36:42</td>
<td>36:45</td>
<td>
<img class="lazy teamThumbs" data-src="/teamLogoThumbnail/logo?teamName=INTEL%20TEAM%20F&amp;raceId=1251&amp;state=OR"/>
                            INTEL TEAM F
                        </td>
</tr>]


## **1.1. Preprocessing**
Our goal here is to convert the data from the webpage into a dataframe so we can do our data magic with it. To get there, we need to get all table rows in list form first and then convert that list into a dataframe. Below is a for loop that iterates through table rows and prints out the cells of the rows.

### **1.1.1 Extracting data from table rows**

In [None]:
for row in rows:
    row_td = row.find_all('td')
print(row_td)
type(row_td)

[<td>577</td>, <td>443</td>, <td>

                    LIBBY B MITCHELL

                </td>, <td>F</td>, <td>HILLSBORO</td>, <td>1:41:18</td>, <td>1:42:10</td>, <td></td>]


bs4.element.ResultSet

### **1.1.2. Cleaning Data: Removing HTML Tags**

The output above shows that each row is printed with html tags embedded in each row. This is not what you want. You can use remove the html tags using Beautiful Soup or regular expressions.

The easiest way to remove html tags is to use Beautiful Soup, and it takes just one line of code to do this. Pass the string of interest into BeautifulSoup() and use the get_text() method to extract the text without html tags.

In [None]:
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)

[577, 443, 

                    LIBBY B MITCHELL

                , F, HILLSBORO, 1:41:18, 1:42:10, ]


The code below shows how to build a regular expression that finds all the characters inside the < td > html tags and replace them with an empty string for each table row. First, you compile a regular expression by passing a string to match to re.compile(). The dot, star, and question mark (.*?) will match an opening angle bracket followed by anything and followed by a closing angle bracket. It matches text in a non-greedy fashion, that is, it matches the shortest possible string. If you omit the question mark, it will match all the text between the first opening angle bracket and the last closing angle bracket. After compiling a regular expression, you can use the re.sub() method to find all the substrings where the regular expression matches and replace them with an empty string. The full code below generates an empty list, extract text in between html tags for each row, and append it to the assigned list.

In [None]:
import re

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')  # matches an opening angle bracket followed by anything and followed by a closing angle bracket
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)
print(clean2)
type(clean2)

[577, 443, 

                    LIBBY B MITCHELL

                , F, HILLSBORO, 1:41:18, 1:42:10, ]


str

## **1.2 Converting Data to Dataframe**
The next step is to convert the list into a dataframe and get a quick view of the first 10 rows using Pandas.

In [None]:
df = pd.DataFrame(list_rows)
df.head(10)

Unnamed: 0,0
0,[]
1,"[Finishers:, 577]"
2,"[Male:, 414]"
3,"[Female:, 163]"
4,[]
5,"[1, 814, \r\n\r\n JARED WIL..."
6,"[2, 573, \r\n\r\n NATHAN A ..."
7,"[3, 687, \r\n\r\n FRANCISCO..."
8,"[4, 623, \r\n\r\n PAUL MORR..."
9,"[5, 569, \r\n\r\n DEREK G O..."


### **1.2.1 Cleaning Data: Formatting the Dataframe**
The dataframe is not in the format we want. To clean it up, you should split the "0" column into multiple columns at the comma position. This is accomplished by using the str.split() method.

In [None]:
df1 = df[0].str.split(',', expand=True)
df1.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,[],,,,,,,
1,[Finishers:,577],,,,,,
2,[Male:,414],,,,,,
3,[Female:,163],,,,,,
4,[],,,,,,,
5,[1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,36:21,36:24,]
6,[2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,36:42,36:45,\n\r\n INTEL TEAM ...
7,[3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,37:44,37:48,]
8,[4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,38:34,38:37,]
9,[5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,39:21,39:24,\n\r\n INTEL TEAM ...


This looks much better, but there is still work to do. The dataframe has unwanted square brackets surrounding each row. You can use the strip() method to remove the opening square bracket on column "0."

In [None]:
df1[0] = df1[0].str.strip('[')
df1.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,],,,,,,,
1,Finishers:,577],,,,,,
2,Male:,414],,,,,,
3,Female:,163],,,,,,
4,],,,,,,,
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,38:34,38:37,]
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,39:21,39:24,\n\r\n INTEL TEAM ...


### **1.2.2 Building Table Headers**

The table is missing table headers. You can use the find_all() method to get the table headers.

In [None]:
col_labels = soup.find_all('th')

Just like what we did with the table rows, you can use Beautiful Soup to extract text in between html tags for table headers.

In [None]:
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
all_header.append(cleantext2)
print(all_header)

['[Place, Bib, Name, Gender, City, Chip Time, Gun Time, Team]']


You can then convert the list of headers into a pandas dataframe.

In [None]:
df2 = pd.DataFrame(all_header)
df2.head()

Unnamed: 0,0
0,"[Place, Bib, Name, Gender, City, Chip Time, Gu..."


Similarly, you can split column "0" into multiple columns at the comma position for all rows.

In [None]:
df3 = df2[0].str.split(',', expand=True)
df3.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,[Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team]


Now we can concatenate the two dataframes into one using the concat() method as illustrated below.

In [None]:
frames = [df3, df1]

df4 = pd.concat(frames)
df4.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,[Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team]
0,],,,,,,,
1,Finishers:,577],,,,,,
2,Male:,414],,,,,,
3,Female:,163],,,,,,
4,],,,,,,,
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,38:34,38:37,]


Below shows how to assign the first row to be the table header.

In [None]:
df5 = df4.rename(columns=df4.iloc[0])
df5.head()

Unnamed: 0,[Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team]
0,[Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team]
0,],,,,,,,
1,Finishers:,577],,,,,,
2,Male:,414],,,,,,
3,Female:,163],,,,,,


At this point, the table is almost properly formatted. For analysis, you can start by getting an overview of the data as shown below.

In [None]:
df5.info()
df5.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 583 entries, 0 to 581
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   [Place      583 non-null    object
 1    Bib        581 non-null    object
 2    Name       578 non-null    object
 3    Gender     578 non-null    object
 4    City       578 non-null    object
 5    Chip Time  578 non-null    object
 6    Gun Time   578 non-null    object
 7    Team]      578 non-null    object
dtypes: object(8)
memory usage: 41.0+ KB


(583, 8)

The table has 583 rows and 10 columns. You can drop all rows with any missing values.

In [None]:
df6 = df5.dropna(axis=0, how='any')

Also, notice how the table header is replicated as the first row in df5. It can be dropped using the following line of code.

In [None]:
df7 = df6.drop(df6.index[0])
df7.head()

Unnamed: 0,[Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team]
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,38:34,38:37,]
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,39:21,39:24,\n\r\n INTEL TEAM ...


You can perform more data cleaning by renaming the '[Place' and ' Team]' columns. Python is very picky about space. Make sure you include space after the quotation mark in ' Team]'.

In [None]:
df7.rename(columns={'[Place': 'Place'},inplace=True)
df7.rename(columns={' Team]': 'Team'},inplace=True)
df7.head()

Unnamed: 0,Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,36:21,36:24,]
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,37:44,37:48,]
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,38:34,38:37,]
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,39:21,39:24,\n\r\n INTEL TEAM ...


## **1.3 Data Cleaning: Removing White Space and Special Characters**

The final data cleaning steps involve removing the closing bracket for cells in the "Team" column, the white space, and the new line characters

In [None]:
df7['Team'] = df7['Team'].str.strip(']')
df7.head()

Unnamed: 0,Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team
5,1,814,\r\n\r\n JARED WILSON\r\n\...,M,TIGARD,36:21,36:24,
6,2,573,\r\n\r\n NATHAN A SUSTERSI...,M,PORTLAND,36:42,36:45,\n\r\n INTEL TEAM ...
7,3,687,\r\n\r\n FRANCISCO MAYA\r\...,M,PORTLAND,37:44,37:48,
8,4,623,\r\n\r\n PAUL MORROW\r\n\r...,M,BEAVERTON,38:34,38:37,
9,5,569,\r\n\r\n DEREK G OSBORNE\r...,M,HILLSBORO,39:21,39:24,\n\r\n INTEL TEAM ...


Removing any white space

In [None]:
df7.replace(r'\s', '', regex = True, inplace = True)

And getting rid of the new line characters

In [None]:
df8 = df7.replace(r'\\n',' ', regex=True) 
df8.head()

Unnamed: 0,Place,Bib,Name,Gender,City,Chip Time,Gun Time,Team
5,1,814,JAREDWILSON,M,TIGARD,36:21,36:24,
6,2,573,NATHANASUSTERSIC,M,PORTLAND,36:42,36:45,INTELTEAMF
7,3,687,FRANCISCOMAYA,M,PORTLAND,37:44,37:48,
8,4,623,PAULMORROW,M,BEAVERTON,38:34,38:37,
9,5,569,DEREKGOSBORNE,M,HILLSBORO,39:21,39:24,INTELTEAMF


It took a while to get here, but at this point, the dataframe is in the desired format. 

If you would like to read about another webscraping project, take a look at [this blog post about scraping a job portal](https://realpython.com/beautiful-soup-web-scraper-python/). This is about getting data from the [Fake Python Jobs](https://realpython.github.io/fake-jobs/) site (**NOTE**: These are **FAKE** posts; the jobs **don't exist**; this is a site built exclusively for static HTML-based web scraping). How is that for hunting for your dream job?

# **2. Working with an API**
Getting information directly from webpages is one thing--in fact, a lot of online  marketing companies specialize in scraping and cleaning data and then sell these data to other businesses for further analysis. As you may already guess, this works only with static HTML sites, that is, with sites that send your browser complete webpages, not just shells of css and javascript or Ajax with a database call in the middle. And, as you have seen, this can be very painful.

That's why some website providers offer application programming interfaces (APIs) that allow you to access their data in a predefined manner. With APIs, you can avoid parsing HTML. Instead, you can access the data directly using formats like JSON and XML. 

When you use an API, the process is generally more stable than gathering the data through web scraping. That’s because developers create APIs to be consumed by programs rather than by human eyes.

## **2.1 Setting up the Data Source**
In order to work with an API, the first step is always to obtain the required login credentials into the source of your data and store these in an **APP** (yes, I said app because that's what this is called--no relation to whatever you have running in your cellphone). Imagine this like getting the key to your house or apartment from your landlord or realtor:


### **2.1.1 API #1: New York Times**


Let's assume we work with the **New York Times** API:
1. Go to [the New York Times API website](https://developer.nytimes.com/get-started) and sign up for an account. 
2. Once you have completed the email verification process, log into the website and start setting your API key up under Get Started (see below). This is the access tool that your Python code needs in order to download data from the API.
<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/nytapi.JPG" width="500">
</div>
4. The instructions ask you to create an app. No worries: That is API-speak for a set of dedicated access keys that allow you to use the API. After you log in, go to Get Started and follow the app generation procedure. Enable the Article Search, the Community API, and the Top Stories API and don't forget to hit "save."
5. This will give you an app ID and a set of keys. 
<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/nytapi_keys.JPG" width="500">
</div>
6. Once you know that you have these available, go to the APIs and learn about how to connect to each of them. Be sure that the API you wish to use is **AUTHORIZED** in your app.

Now you have set up ONE side of the puzzle--**the API side.**

To learn more about how to connect into the New York Times API, check out this [blog post](https://dlab.berkeley.edu/blog/scraping-new-york-times-articles-python-tutorial) or [this (somewhat older) notebook](https://github.com/nilmolne/Text-Mining-The-New-York-Times-Articles/blob/master/Code/HowToUse.ipynb) or this [notebook about COVID-related articles](https://github.com/brienna/coronavirus-news-analysis/blob/master/2020_05_01_get_data_from_NYT.ipynb).


### **2.1.2 API #2: Reddit**

The principle here is the same as before: Build a user account and configure your app, then write down your key(s) because you'll need it/ them when your code wants to connect to the data source. Here is how this works on Reddit, complete with demonstration:






In [None]:
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/FdjVoOf9HN4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

### **2.1.3 API #3: Twitter**

Social Media APIs are the hardest to get access to. Right now, in order to qualify for an app on Twitter, you have to submit detailed explanations about how you will access the data, how you will store them, and how you will analyze them. In contrast to the New York Times API, where authorization is automated and instant, Twitter, just like Facebook, Instagram, and others, has actual human employees review and approve your application. That's because ever since the [Cambridge Analytica scandal](https://en.wikipedia.org/wiki/Cambridge_Analytica) and with the strict enforcement of [GDPR](https://gdpr.eu/), social networks can get into **a lot** of very expensive trouble if they do not protect user data well. 

Read [here](https://developer.twitter.com/en/products/twitter-api/standard) about Twitter's policies for using their API and check out the requirements for [applying for a "developer account,"](https://developer.twitter.com/en/portal/petition/use-case) which is the platform on which you would build your app. [These instructions](https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2) will walk you through the process step-by-step

**HERE ARE SOME OF THE QUESTIONS YOU WILL HAVE TO ANSWER** if you apply for API access:

* How will you use the Twitter API or Twitter Data?
  * In English, please describe how you plan to use Twitter data and/or APIs. The more detailed the response, the easier it is to review and approve. Please be thoughtful and thorough

* Please answer each of the following with as much detail and accuracy as possible. Failure to do so could result in delays to your access to Twitter developer platform or rejected applications.
  * Are you planning to analyze Twitter data? Please describe how you will analyze Twitter data including any analysis of Tweets or Twitter users. Please be thoughtful and thorough
* Will your app use Tweet, Retweet, Like, Follow, or Direct Message functionality?
  * Please describe your planned use of these features. Please be thoughtful and thorough
* Do you plan to display Tweets or aggregate data about Twitter content outside Twitter?
  * Please describe how and where Tweets and/or data about Twitter content will be displayed outside of Twitter. Please be thoughtful and thorough
* Will your product, service, or analysis make Twitter content or derived information available to a government entity?
  * Please list all government entities you intend to provide Twitter content or derived information to under this use case.

Once you have written the Great American Novel for each of these fields, submitted, and verified your email address, you'll get here:

<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/twitter.JPG" width="500">
</div>

**FINALLY**, you'll be routed to a set of tutorials which provide you with the code stubs for [sentiment analysis](https://developer.twitter.com/en/docs/tutorials/how-to-analyze-the-sentiment-of-your-own-tweets) or time series analysis or any other analysis you could wish for. 

That's a lot of work, isn't it?

## **2.2 Setting up Python**
Now that you have set up the API of your choice, the second part of the project is to connect to it and retrieve the data. This means that, now that you have the key(s) available, you write your Colab or Jupyter Notebook code to use these keys to log into the API. This comes in different flavors, as well.



### **2.2.1 New York Times**
So, to connect to the **New York Times**, the simplest code can look like what you are seeing below, with the requests, os, an pprint libraries. Note that  NYTIMES_APIKEY is the key that you received when signing up for the New York Times APIs (see screenshot above). Here you see the Top Story API in action:*italicized text*

In [None]:
import requests
import os
from pprint import pprint

apikey = os.getenv('NYTIMES_APIKEY', '...')

# Top Stories:
# https://developer.nytimes.com/docs/top-stories-product/1/overview
section = "science"
query_url = f"https://api.nytimes.com/svc/topstories/v2/{section}.json?api-key={apikey}"

r = requests.get(query_url)
pprint(r.json())

{'fault': {'detail': {'errorcode': 'oauth.v2.InvalidApiKey'},
           'faultstring': 'Invalid ApiKey'}}


The snippet above is very straightforward. We run a GET request against topstories/v2 endpoint supplying section name and our API key. 

The output comes in JSON format and looks like this:
```
{ 'last_updated': '2020-08-09T08:07:44-04:00',
 'num_results': 25,
 'results': [{'abstract': 'New Zealand marked 100 days with no new reported '
                          'cases of local coronavirus transmission. France '
                          'will require people to wear masks in crowded '
                          'outdoor areas.',
              'byline': '',
              'created_date': '2020-08-09T08:00:12-04:00',
              'item_type': 'Article',
              'multimedia': [{'caption': '',
                              'copyright': 'The New York Times',
                              'format': 'superJumbo',
                              'height': 1080,
                              'subtype': 'photo',
                              'type': 'image',
                              'url': 'https://static01.nyt.com/images/2020/08/03/us/us-briefing-promo-image-print/us-briefing-promo-image-superJumbo.jpg',
                              'width': 1920},
                             ],
              'published_date': '2020-08-09T08:00:12-04:00',
              'section': 'world',
              'short_url': 'https://nyti.ms/3gH9NXP',
              'title': 'Coronavirus Live Updates: DeWine Stresses Tests’ '
                       'Value, Even After His False Positive',
              'uri': 'nyt://article/27dd9f30-ad63-52fe-95ab-1eba3d6a553b',
              'url': 'https://www.nytimes.com/2020/08/09/world/coronavirus-covid-19.html'},
             ]
 }

```
That is the shortest API call. The article API gives you more filtering options. The only mandatory field is q (query), which is the search term. Beyond that you can mix and match filter query, date range ( begin_date, end_date), page number, sort order and facet fields. 
```
# Article Search:
# https://api.nytimes.com/svc/search/v2/articlesearch.json?q=<QUERY>&api-key=<APIKEY>
# Use - https://developer.nytimes.com/docs/articlesearch-product/1/routes/articlesearch.json/get to explore API

query = "politics"
begin_date = "20200701"  # YYYYMMDD
filter_query = "\"body:(\"Trump\") AND glocations:(\"WASHINGTON\")\""  # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "0"  # <0-100>
sort = "relevance"  # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
            f"q={query}" \
            f"&api-key={apikey}" \
            f"&begin_date={begin_date}" \
            f"&fq={filter_query}" \
            f"&page={page}" \
            f"&sort={sort}"

r = requests.get(query_url)
pprint(r.json())
```
The final challenge is to transform the JSON content into a file that can serve as a data frame. 

To learn more about newspaper APIs, check out [Martin Heinz' blog post](https://martinheinz.dev/blog/31) about how to connect to them.


### **2.2.2 Reddit**
Connecting with **Reddit** is explained really well in [this article](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c), which the code in this section summarizes. Also, you need more than just one token this time around:
1. Your Reddit Login and password
2. The personal use script key you receive when you sign up for your app
3. The secret key that you receive when you sign up for your app

Once you have these, you will need to set up your OAuth configuration. This will assign to you a token that expires every 2 hours:

```
import requests

# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('<CLIENT_ID>', '<SECRET_TOKEN>')

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': '<USERNAME>',
        'password': '<PASSWORD>'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'MyBot/0.0.1'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)
```
That is the first part. Now the API trusts us, and we can retrieve text. First, we'll look at the most popular posts:
```
res = requests.get("https://oauth.reddit.com/r/python/hot",
                   headers=headers)

print(res.json())  # This will pull out all the "hot" posts
```
The result will be a big ugly JSON-formatted set. Think about it as a bag of words.


---


Let's inspect this bag of words in order to find the most interesting posts about--what else?--cats. First, we'll explore the titles of the retrieved posts:
```
for post in res.json()['data'] ['cats']:
   print(post['data'] ['title']
```
There is always enough material on Reddit about cats! So, let's build this search as a filter into our data retrieval query **AND** collect the output in a dataframe (which we will name "fluffy"):

```
# make a request for the trending posts in /r/Python
res = requests.get("https://oauth.reddit.com/r/python/hot",
                   headers=headers)

fluffy = pd.DataFrame()  # initializing our dataframe

# loop through each post retrieved from GET request
for post in res.json()['data']['cats']:
    # here, we append relevant data to our dataframe
    df = df.append({
        'subreddit': post['data']['subreddit'],
        'title': post['data']['title'],
        'selftext': post['data']['selftext'],
        'upvote_ratio': post['data']['upvote_ratio'],
        'ups': post['data']['ups'],
        'downs': post['data']['downs'],
        'score': post['data']['score']
    }, ignore_index=True)

```
---
Afterwards, inspect the dataframe with fluffy.head(), and you'll see all the new posts about cats already in a dataframe. Follow the steps in part 1 of this workbook to clean your data up a little, including removing URLs and any special characters, and  your data is ready for analysis!


### **2.2.3 Twitter**
Connecting to any high-visibility social media is more involved these days due to data privacy concerns--even if the data is as public as on Twitter (trying to get the Facebook or Instagram APIs set up and calls working can take days, mostly for approval turnarounds).

BUT, assuming that you have your logins and your tokens available, the Twitter API is easily managed. Twitter has published [a treasure trove of code stubs](https://github.com/twitterdev/Twitter-API-v2-sample-code) on GitHub for that purpose. 

The most interesting code snippets for our purpose are:
* [Full archive search](https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/main/Full-Archive-Search/full-archive-search.py)
* [User Tweet timeline](https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/main/User-Tweet-Timeline/user_tweets.py)

Twitter also allows you to connect with the [Postman package](https://developer.twitter.com/en/docs/tools-and-libraries/using-postman), which uses HTTP to retrieve data through a GUI (of course, any true hacker wouldn't be caught dead using a GUI).

If you already have Twitter API access, I encourage you to explore the options here. If not, the NYT or the Reddit APIs might be easier for you to work with.


## Your Turn
In this workbook, you have encountered 2 major methods of obtaining data: Either through direct webscraping (section 1) or through the use of an API (section 2). One of the big takeaways here is that all APIs behave slightly differently, but the process has several steps in common:
1. Register for a user account on the website
2. Build your app
3. Note your keys, which you will need to connect
4. Pivot to your notebook
5. Install at a minimum the requests and os packages
6. Set up variables to hold your keys
7. Test the authentication method
8. Write your query code--test
9. Edit your query code to pull data into a dataframe--test
10. Clean and format the data
Pick one of these methods or one of the APIs with which you want to work. Then see what data you can pull down. Try formatting the data. If you would like more help converting JSON output to a pandas dataframe than you are seeing above, [this article](https://towardsdatascience.com/how-to-convert-json-into-a-pandas-dataframe-100b2ae1e0d8) will walk you through the individual steps.

In [85]:
Client_ID = 'LjekfhCDls3khKj3Lqejow'
Secret_Key = 'Z8AYpf0T559U6e5H-m1va4RRUfjVrQ' 

In [86]:
import requests
auth = requests.auth.HTTPBasicAuth(Client_ID, Secret_Key)

In [87]:
data = {'grant_type': 'password',
        'username': 'lwallac2',
        'password': 'Tucker13'}

In [88]:
headers = {'User-Agent': 'MyAPI/0.0.1'}

In [89]:
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

In [94]:
res.json()

{'access_token': '1716145450863-TKuT41Fli8UN539vAjjror12B-XYpg',
 'expires_in': 86400,
 'scope': '*',
 'token_type': 'bearer'}

In [91]:
TOKEN = res.json()['access_token']

In [92]:
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

In [93]:
headers

{'Authorization': 'bearer 1716145450863-TKuT41Fli8UN539vAjjror12B-XYpg',
 'User-Agent': 'MyAPI/0.0.1'}

In [95]:
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

<Response [200]>

In [96]:
res = requests.get("https://oauth.reddit.com/r/dogs/new",
                   headers=headers, params={'limit':'100'})

In [97]:
res.json()

{'data': {'after': 't3_u2iohk',
  'before': None,
  'children': [{'data': {'all_awardings': [],
     'allow_live_comments': False,
     'approved_at_utc': None,
     'approved_by': None,
     'archived': False,
     'author': 'ExternalConcert7430',
     'author_flair_background_color': None,
     'author_flair_css_class': None,
     'author_flair_richtext': [],
     'author_flair_template_id': None,
     'author_flair_text': None,
     'author_flair_text_color': None,
     'author_flair_type': 'text',
     'author_fullname': 't2_9gxfaloh',
     'author_is_blocked': False,
     'author_patreon_flair': False,
     'author_premium': False,
     'awarders': [],
     'banned_at_utc': None,
     'banned_by': None,
     'can_gild': True,
     'can_mod_post': False,
     'category': None,
     'clicked': False,
     'content_categories': None,
     'contest_mode': False,
     'created': 1649876544.0,
     'created_utc': 1649876544.0,
     'discussion_type': None,
     'distinguished': None,
  

In [98]:
for post in res.json()['data']['children']:
   print(post ['data'] ['title'])

Pancreatitis Recovery
First six weeks of chemical castration, what are your experiences?
Duck Jerky
Seeking advice! Rehomed a dog about 1.5 years ago and need advice if I should even think about getting another dog.
My 2 1/2 year-old dog got really aggressive towards me out of nowhere last night. Any thoughts of why?
Emotional support dog for my wife. How can I go about helping her out on this?
My dogs temperament has changed with age
Poo perils!
Tips to stop my dog from chewing on our shoes
Vet
My dog is suddenly refusing to walk?
How do you deal with owners who have no idea what to do with their dog in public?
Anxious non-stop barking dog (during the night)
Nail grinder... but WHICH?
American Hairless Terrier
DCM: any updates from vet schools or new studies? - best pet food
Need a bed my amstaff won't chew to pieces..
Dog died, leaving older brother alone.
clear, odourless &amp; colourless liquid coming out of dog
leaving puppy at home for 5 hours....
what breed of dog do you recomme

In [99]:
import pandas as pd
df = pd.DataFrame()

In [100]:
for post in res.json()['data']['children']:
   df = df.append({
       'subreddit': post['data']['subreddit'],
       'title': post['data']['title'],
       'selftext': post['data']['selftext'],
       'upvote_ratio': post['data']['upvote_ratio'],
       'ups': post['data']['ups'],
       'downs': post['data']['downs'],
       'score': post['data']['score']
   }, ignore_index=True)

In [101]:
df

Unnamed: 0,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,dogs,Pancreatitis Recovery,My dog was hospitalized two separate times in ...,1.00,1.0,0.0,1.0
1,dogs,"First six weeks of chemical castration, what a...",I’m on week three of the six month chip with m...,1.00,2.0,0.0,2.0
2,dogs,Duck Jerky,So I have a bag of duck jerky bought at the lo...,1.00,1.0,0.0,1.0
3,dogs,Seeking advice! Rehomed a dog about 1.5 years ...,So just some history on me. I had a Lab for a...,1.00,2.0,0.0,2.0
4,dogs,My 2 1/2 year-old dog got really aggressive to...,He’s a rescue. Some type of shepherd and hound...,1.00,1.0,0.0,1.0
...,...,...,...,...,...,...,...
95,dogs,My dog won’t stop barking at night and I don’t...,"Hello Everyone. \nSo at this point, im upset a...",1.00,1.0,0.0,1.0
96,dogs,What are things your dog does that prove they ...,I have two dogs and they are one of the best t...,0.60,1.0,0.0,1.0
97,dogs,Signs my dogs are cold,"I adopted both of my dogs in Florida, then mov...",0.50,0.0,0.0,0.0
98,dogs,14 week old (lab) retriever/ mastiff mix help,We adopted a (m) retriever/mastiff mix that is...,0.50,0.0,0.0,0.0


# Website Web Scraping

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
from urllib.request import urlopen
url = "https://www.colts.com/team/players-roster/"
html = urlopen(url)

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

In [None]:
# Get the title
title = soup.title
print(title)

<title>Colts Roster | Indianapolis Colts - colts.com</title>


In [None]:
# Print out the text
text = soup.get_text()
print(soup.text)

(function(){var host=window.location.hostname;var element=document.createElement('script');var firstScript=document.getElementsByTagName('script')[0];var url='https://quantcast.mgr.consensu.org'.concat('/choice/','gGpYeVwuEvd4w','/',host,'/choice.js')
var uspTries=0;var uspTriesLimit=3;element.async=true;element.type='text/javascript';element.src=url;firstScript.parentNode.insertBefore(element,firstScript);function makeStub(){var TCF_LOCATOR_NAME='__tcfapiLocator';var queue=[];var win=window;var cmpFrame;function addFrame(){var doc=win.document;var otherCMP=!!(win.frames[TCF_LOCATOR_NAME]);if(!otherCMP){if(doc.body){var iframe=doc.createElement('iframe');iframe.style.cssText='display:none';iframe.name=TCF_LOCATOR_NAME;doc.body.appendChild(iframe);}else{setTimeout(addFrame,5);}}
return!otherCMP;}
function tcfAPIHandler(){var gdprApplies;var args=arguments;if(!args.length){return queue;}else if(args[0]==='setGdprApplies'){if(args.length>3&&args[2]===2&&typeof args[3]==='boolean'){gdprApp

In [None]:
soup.find_all('a')

[<a class="d3-u-block-bypass" href="#main-content" tabindex="0"> <span>Skip to main content</span> </a>,
 <a class="d3-o-nav__logo" data-event_name="click action" data-link_module="Header" data-link_name="Nav Logo" data-link_type="Nav Logo" data-link_url="/" href="https://www.colts.com" title="Link to club's homepage"> <picture><!--[if IE 9]><video style="display:none"><![endif]--><source data-srcset="https://static.www.nfl.com/t_q-best/league/api/clubs/logos/IND" media="(min-width:1024px)"><source data-srcset="https://static.www.nfl.com/t_q-best/league/api/clubs/logos/IND" media="(min-width:768px)"><source data-srcset="https://static.www.nfl.com/t_q-best/league/api/clubs/logos/IND"><!--[if IE 9]></video><![endif]--></source></source></source></picture> </a>,
 <a aria

In [None]:
all_links = soup.find_all("a")
for link in all_links:
    print(link.get("href"))

#main-content
https://www.colts.com
https://www.colts.com/news/2022-nfl-free-agency-tracker-signings-trades-transactions
https://www.colts.com/news/index
https://www.colts.com/video/index
https://www.colts.com/audio/index
https://www.colts.com/photos/index
https://www.colts.com/team/players-roster
https://www.colts.com/schedule/index
#2ndlevel
https://www.colts.com/events/index
https://www.colts.com/fans/index
https://www.colts.com/clubs/index
https://www.colts.com/cheerleaders/index
https://www.colts.com/community/index
https://www.colts.com/game-day/index
https://www.colts.com/blue/index
http://forums.colts.com/
https://www.colts.com/fans/gridiron-hall/
https://www.colts.com/community/youthfootball
https://www.colts.com/community/kicking-the-stigma
https://www.colts.com/tickets/index
https://www.colts.com/fans/app
https://www.gopjn.com/t/RkFIS01FS0RBTEdER0ZBSUlMRUg
https://shop.colts.com/?_s=bm-colts-topnav-shop-042319
https://sports.yahoo.com/nfl/live-video/?is_retargeting=true&af_s

In [None]:
# Print the first 7 rows for sanity check
rows = soup.find_all('tr')
print(rows [:70])

[<tr><th>Player</th><th class="{sorter:'append'}">#</th><th class="{sorter:'append'}">Pos</th><th>HT</th><th class="{sorter:'append'}">WT</th><th class="{sorter:'append'}">Age</th><th class="{sorter:'append'}">Exp</th><th class="{sorter:'append'}">College</th></tr>, <tr><td class="sorter-lastname" scope="row" tabindex="0"><div class="d3-o-media-object"><figure class="d3-o-media-object__figure"><a href="/team/players-roster/mo-alie-cox/" title="Mo Alie-Cox"> <picture is-lazy="/t_lazy"><!--[if IE 9]><video style="display:none"><![endif]--><source media="(min-width:1024px)" srcset="https://static.clubs.nfl.com/image/private/t_thumb_squared/t_lazy/f_auto/colts/v2g8ry0udf5357tbzkyx.jpg 1x, https://static.clubs.nfl.com/image/private/t_thumb_squared_2x/t_lazy/f_auto/colts/v2g8ry0udf5357tbzkyx.jpg 2x, https://static.clubs.nfl.com/image/private/t_thumb_squared_3x/t_lazy/f_auto/colts/v2g8ry0udf5357tbzkyx.jpg"><source media="(min-width:768px)" srcset="https://static.clubs.nfl.com/image/private/t_

In [None]:
for row in rows:
    row_td = row.find_all('td')
print(row_td)
type(row_td)

[<td class="sorter-lastname" scope="row" tabindex="0"><div class="d3-o-media-object"><figure class="d3-o-media-object__figure"><a href="/team/players-roster/eli-wolf/" title="Eli Wolf"> <picture is-lazy="/t_lazy"><!--[if IE 9]><video style="display:none"><![endif]--><source media="(min-width:1024px)" srcset="https://static.clubs.nfl.com/image/private/t_thumb_squared/t_lazy/f_auto/colts/q6cveirg1nvupbdbwauc.jpg 1x, https://static.clubs.nfl.com/image/private/t_thumb_squared_2x/t_lazy/f_auto/colts/q6cveirg1nvupbdbwauc.jpg 2x, https://static.clubs.nfl.com/image/private/t_thumb_squared_3x/t_lazy/f_auto/colts/q6cveirg1nvupbdbwauc.jpg"><source media="(min-width:768px)" srcset="https://static.clubs.nfl.com/image/private/t_thumb_squared/t_lazy/f_auto/colts/q6cveirg1nvupbdbwauc.jpg 1x, https://static.clubs.nfl.com/image/private/t_thumb_squared_2x/t_lazy/f_auto/colts/q6cveirg1nvupbdbwauc.jpg 2x, https://static.clubs.nfl.com/image/private/t_thumb_squared_3x/t_lazy/f_auto/colts/q6cveirg1nvupbdbwauc

bs4.element.ResultSet

In [None]:
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)

[  Eli Wolf, 85, TE, 6-4, 238, 25, 1, Georgia]


In [None]:
import re

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')  # matches an opening angle bracket followed by anything and followed by a closing angle bracket
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)
print(clean2)
type(clean2)

[  Eli Wolf, 85, TE, 6-4, 238, 25, 1, Georgia]


str

In [None]:
df = pd.DataFrame(list_rows)
df.head(10)

Unnamed: 0,0
0,[]
1,"[ Mo Alie-Cox, 81, TE, 6-5, 267, 28, 5, Virgi..."
2,"[ Ben Banogu, 52, DE, 6-3, 252, 26, 4, Texas ..."
3,"[ Julian Blackmon, 32, S, 6-0, 187, 23, 3, Utah]"
4,"[ Rodrigo Blankenship, 3, K, 6-1, 184, 25, 3,..."
5,"[ Tony Brown, , CB, 6-0, 198, 26, 4, Alabama]"
6,"[ DeForest Buckner, 99, DT, 6-7, 295, 28, 7, ..."
7,"[ Parris Campbell, 1, WR, 6-0, 208, 24, 4, Oh..."
8,"[ Anthony Chesley, 47, CB, 6-0, 190, 26, 2, C..."
9,"[ Kameron Cline, 92, DE, 6-4, 283, 24, 1, Sou..."


In [None]:
df1 = df[0].str.split(',', expand=True)
df1.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,[],,,,,,,
1,[ Mo Alie-Cox,81.0,TE,6-5,267.0,28.0,5.0,Virginia Commonwealth]
2,[ Ben Banogu,52.0,DE,6-3,252.0,26.0,4.0,Texas Christian]
3,[ Julian Blackmon,32.0,S,6-0,187.0,23.0,3.0,Utah]
4,[ Rodrigo Blankenship,3.0,K,6-1,184.0,25.0,3.0,Georgia]
5,[ Tony Brown,,CB,6-0,198.0,26.0,4.0,Alabama]
6,[ DeForest Buckner,99.0,DT,6-7,295.0,28.0,7.0,Oregon]
7,[ Parris Campbell,1.0,WR,6-0,208.0,24.0,4.0,Ohio State]
8,[ Anthony Chesley,47.0,CB,6-0,190.0,26.0,2.0,Coastal Carolina]
9,[ Kameron Cline,92.0,DE,6-4,283.0,24.0,1.0,South Dakota]


In [None]:
df1[0] = df1[0].str.strip('[')
df1.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,],,,,,,,
1,Mo Alie-Cox,81.0,TE,6-5,267.0,28.0,5.0,Virginia Commonwealth]
2,Ben Banogu,52.0,DE,6-3,252.0,26.0,4.0,Texas Christian]
3,Julian Blackmon,32.0,S,6-0,187.0,23.0,3.0,Utah]
4,Rodrigo Blankenship,3.0,K,6-1,184.0,25.0,3.0,Georgia]
5,Tony Brown,,CB,6-0,198.0,26.0,4.0,Alabama]
6,DeForest Buckner,99.0,DT,6-7,295.0,28.0,7.0,Oregon]
7,Parris Campbell,1.0,WR,6-0,208.0,24.0,4.0,Ohio State]
8,Anthony Chesley,47.0,CB,6-0,190.0,26.0,2.0,Coastal Carolina]
9,Kameron Cline,92.0,DE,6-4,283.0,24.0,1.0,South Dakota]


In [None]:
col_labels = soup.find_all('th')

In [None]:
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, "lxml").get_text()
all_header.append(cleantext2)
print(all_header)

['[Player, #, Pos, HT, WT, Age, Exp, College]']


In [None]:
df2 = pd.DataFrame(all_header)
df2.head()

Unnamed: 0,0
0,"[Player, #, Pos, HT, WT, Age, Exp, College]"


In [None]:
df3 = df2[0].str.split(',', expand=True)
df3.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,[Player,#,Pos,HT,WT,Age,Exp,College]


In [None]:
frames = [df3, df1]

df4 = pd.concat(frames)
df4.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,[Player,#,Pos,HT,WT,Age,Exp,College]
0,],,,,,,,
1,Mo Alie-Cox,81,TE,6-5,267,28,5,Virginia Commonwealth]
2,Ben Banogu,52,DE,6-3,252,26,4,Texas Christian]
3,Julian Blackmon,32,S,6-0,187,23,3,Utah]
4,Rodrigo Blankenship,3,K,6-1,184,25,3,Georgia]
5,Tony Brown,,CB,6-0,198,26,4,Alabama]
6,DeForest Buckner,99,DT,6-7,295,28,7,Oregon]
7,Parris Campbell,1,WR,6-0,208,24,4,Ohio State]
8,Anthony Chesley,47,CB,6-0,190,26,2,Coastal Carolina]


In [None]:
df5 = df4.rename(columns=df4.iloc[0])
df5.head()

Unnamed: 0,[Player,#,Pos,HT,WT,Age,Exp,College]
0,[Player,#,Pos,HT,WT,Age,Exp,College]
0,],,,,,,,
1,Mo Alie-Cox,81,TE,6-5,267,28,5,Virginia Commonwealth]
2,Ben Banogu,52,DE,6-3,252,26,4,Texas Christian]
3,Julian Blackmon,32,S,6-0,187,23,3,Utah]


In [None]:
df5.info()
df5.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 60
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   [Player    62 non-null     object
 1    #         61 non-null     object
 2    Pos       61 non-null     object
 3    HT        61 non-null     object
 4    WT        61 non-null     object
 5    Age       61 non-null     object
 6    Exp       61 non-null     object
 7    College]  61 non-null     object
dtypes: object(8)
memory usage: 4.4+ KB


(62, 8)

In [None]:
df6 = df5.dropna(axis=0, how='any')

In [None]:
df7 = df6.drop(df6.index[0])
df7.head()

Unnamed: 0,[Player,#,Pos,HT,WT,Age,Exp,College]
1,Mo Alie-Cox,81.0,TE,6-5,267,28,5,Virginia Commonwealth]
2,Ben Banogu,52.0,DE,6-3,252,26,4,Texas Christian]
3,Julian Blackmon,32.0,S,6-0,187,23,3,Utah]
4,Rodrigo Blankenship,3.0,K,6-1,184,25,3,Georgia]
5,Tony Brown,,CB,6-0,198,26,4,Alabama]


In [None]:
df7.rename(columns={'[Player': 'Player'},inplace=True)
df7.rename(columns={' College]': 'College'},inplace=True)
df7.head()

Unnamed: 0,Player,#,Pos,HT,WT,Age,Exp,College
1,Mo Alie-Cox,81.0,TE,6-5,267,28,5,Virginia Commonwealth]
2,Ben Banogu,52.0,DE,6-3,252,26,4,Texas Christian]
3,Julian Blackmon,32.0,S,6-0,187,23,3,Utah]
4,Rodrigo Blankenship,3.0,K,6-1,184,25,3,Georgia]
5,Tony Brown,,CB,6-0,198,26,4,Alabama]


In [None]:
df7['College'] = df7['College'].str.strip(']')
df7.head()

Unnamed: 0,Player,#,Pos,HT,WT,Age,Exp,College
1,Mo Alie-Cox,81.0,TE,6-5,267,28,5,Virginia Commonwealth
2,Ben Banogu,52.0,DE,6-3,252,26,4,Texas Christian
3,Julian Blackmon,32.0,S,6-0,187,23,3,Utah
4,Rodrigo Blankenship,3.0,K,6-1,184,25,3,Georgia
5,Tony Brown,,CB,6-0,198,26,4,Alabama


In [None]:
df7.replace(r'\s', '', regex = True, inplace = True)

In [None]:
df8 = df7.replace(r'\\n',' ', regex=True) 
df8.head()

Unnamed: 0,Player,#,Pos,HT,WT,Age,Exp,College
1,MoAlie-Cox,81.0,TE,6-5,267,28,5,VirginiaCommonwealth
2,BenBanogu,52.0,DE,6-3,252,26,4,TexasChristian
3,JulianBlackmon,32.0,S,6-0,187,23,3,Utah
4,RodrigoBlankenship,3.0,K,6-1,184,25,3,Georgia
5,TonyBrown,,CB,6-0,198,26,4,Alabama


In [None]:
import requests
import os
from pprint import pprint

apikey = os.getenv('NYTIMES_APIKEY', '...')

# Top Stories:
# https://developer.nytimes.com/docs/top-stories-product/1/overview
section = "dogs"
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?q=election&api-key=1AUdlg1E5MIhqlMcdGknbAHxan1PTIxG"

r = requests.get(query_url)
pprint(r.json())

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                                        'type': 'image',
                                        'url': 'images/2022/03/11/us/politics/11pol-haley/11pol-haley-filmstrip.jpg',
                                        'width': 190},
                                       {'caption': None,
                                        'credit': None,
                                        'crop_name': 'square640',
                                        'height': 640,
                                        'legacy': {},
                                        'rank': 0,
                                        'subType': 'square640',
                                        'subtype': 'square640',
                                        'type': 'image',
                                        'url': 'images/2022/03/11/us/politics/11pol-haley/11pol-haley-square640.jpg',
                                        'width': 640},
        

In [None]:
import requests

# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('fDRaSbR_mqpLL69vd7_v5w', 'pN2r-JXCFX2aVGSTtLJ2rWONydUA_w')

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': '<Logan_App>',
        'password': '<wallace>'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'MyBot/0.0.1'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/prefs/apps/',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

JSONDecodeError: ignored

In [None]:
# Article Search:
# https://api.nytimes.com/svc/search/v2/articlesearch.json?q=<QUERY>&api-key=<APIKEY>
# Use - https://developer.nytimes.com/docs/articlesearch-product/1/routes/articlesearch.json/get to explore API

query = "dog"
begin_date = "20200701"  # YYYYMMDD
filter_query = "\"body:(\"Breed\") AND glocations:(\"World\")\""  # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "0"  # <0-100>
sort = "relevance"  # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
            f"q={query}" \
            f"&api-key={apikey}" \
            f"&begin_date={begin_date}" \
            f"&fq={filter_query}" \
            f"&page={page}" \
            f"&sort={sort}"

r = requests.get(query_url)
pprint(r.json())

{'fault': {'detail': {'errorcode': 'oauth.v2.InvalidApiKey'},
           'faultstring': 'Invalid ApiKey'}}
