In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

## Working with Strings in a data frame

In [None]:
got_deaths = pd.read_csv("./data/GoT_Character_Deaths.csv")
got_deaths.sample(10)

In [None]:
name = "Jonathan Arp" 
name.upper()

In [None]:
got_deaths['ALL_CAPS'] = got_deaths['Allegiances'].str.upper()
got_deaths

In [None]:
got_deaths['also allegiances, but shouted'] = got_deaths['Allegiances'].str.upper() + "!!!"
got_deaths.head()

# Web Scrapping

Web scrapping is very large concept and involves a deep understanding of how websites are created and managed. You will also need to know some fundamentals of HTML. In this section we will do a very basic foundations of extracting the data from the websites. 

### `pd.read_html()`

Using the pandas package, you can read the tables that are created on the websites. It reads all the tables that are available on the webpage. 

The following example extracts the NBA 2017 draft data set from the [Sports Reference](https://www.basketball-reference.com/draft/NBA_2017.html) website

In [None]:
nba_data_list = pd.read_html("https://www.basketball-reference.com/draft/NBA_2017.html") 
print(type(nba_data_list))
print(len(nba_data_list))
type(nba_data_list[0])

You will notice that after `read_html()` returns a list. There can be multiple tables in a given webpage. The `read_html()` method returns list of tables. In this webpage there is only one table. So you can access the table with the 0th indexed element. 

In [None]:
nba_df = nba_data_list[0]
nba_df

In [None]:
nba_df.keys()

Information on the web pages is not always clean. In this case you might have observed the column names are all multilevel indexes. You can change the column names as indicated on the website by renaming the column names. 

In [None]:
nba_df.columns = ['Rk', 'Pk', 'Tm','Player','College', 'Yrs','G', 'MP', 'PTS','TRB','AST','FG%', 
                    '3P%', 'FT%', 'MP', 'PTS', 'TRB', 'AST', 'WS', 'WS/48', 'BPM', 'VORP']

nba_df.head()

#### Clean the data

Data downloaded from the webpages, most certainly requires to be cleaned. The following is a simple example of deleting unnnecessary data. 

You will notice that the internet data is **messy**. For example, if you actually see the rows from 28:34, you will see that index 30, 31 had data that is not required. Look at the [website](https://www.basketball-reference.com/draft/NBA_2017.html) the table has a break, so the the DataFrame has unnecessary information. 

In [None]:
nba_df.loc[28:34]

In [None]:
# Drop those two rows with those indices and you are saying inplace=True, to make sure you are not creating a copy. 
nba_df.drop([30,31], axis=0, inplace=True)
nba_df.loc[28:34]

## More on cleaning up data 

We are goin gto download the top 250 movies from [IMDB](http://www.imdb.com/chart/top?ref_=nv_wl_img_3) list 

We need to clean the data and remove unnecessary rows and columns like before. But there's more we want to do. 

Notice that the Title acutally has the date of the movie in it. That's not helpful. Wouldn't it be great to have a column that had the date. That would be very useful for our data analysis goals

Like, asking which movie released in 2014 has highest IMDb rating. Or which year had the highest average rankings.

In [None]:
movie_df = pd.read_html("https://www.imdb.com/chart/top?ref_=nv_wl_img_3")[0]
movie_df.head()

### Dropping unnecessary columns
let's drop the columns that have no useful data. First he pass in a `list` of columns. Remember that the default is to delete rows, so we add `axis=1` to tell pandas we are dropping collumns. Lastly we want the changes to remain so we add `inplace=True`

In [None]:
movie_df.drop(["Unnamed: 0", "Unnamed: 4", "Your Rating"], axis=1, inplace=True)
movie_df.head()

In [None]:
# using the str function, let's grab the year and put it in a column called 'year'
movie_df['year'] = movie_df['Rank & Title'].str[-5:-1]
movie_df['Rank & Title'] = movie_df['Rank & Title'].str[0:-6]
movie_df.head()

In [None]:
movie_df['ranking'] = movie_df['Rank & Title'].str.extract('(\d{1,4}).\ ')
movie_df['title'] = movie_df['Rank & Title'].str.extract('\d{1,4}.\ (.*)')
movie_df.drop(['Rank & Title'], axis=1, inplace=True)
movie_df.head()

In [None]:
movie_14_df = movie_df[movie_df['year'] == '2014']
df_sorted = movie_14_df.sort_values('IMDb Rating', ascending=False)
df_sorted

In [None]:
movie_14_df[ movie_14_df['IMDb Rating'] == movie_14_df['IMDb Rating'].max() ]

In [None]:
# aggregate the highest average rankings for each year
movies_by_year = movie_df.groupby('year').mean()
# Highest ranked year
movies_by_year.loc[ movies_by_year['IMDb Rating'].idxmax()]

In [None]:
movies_by_year
# movies_by_year['IMDb Rating'].idxmax()

# Packages for webscrapping 

* urllib
* requests
* **BeautifulSoup**
* mechanize

This will require some fundamentals on HTML, the language used to display the webpages on the browser. 

In [None]:
import urllib
import requests
from bs4 import BeautifulSoup

In [None]:
req = requests.get("https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals")
page = req.text
page

In [None]:
page_soup = BeautifulSoup(page, 'html.parser')
page_soup

You can print the actual webpage and its contents. 

**Warning**: The contents of a webpage are messy and may not be obvious for the first time. However, if you want to scrape any website, you will have to be patient and look through the contents to extract the information. 

In [None]:
print(page_soup.prettify())

In [None]:
page_soup.title

In [None]:
page_soup.title.string

### Searching in the webpage

You can programmatically search through a webpage to find the tables that are available on the webpage. You can do that by using **`find_all()`** method. 

In [None]:
states_table = page_soup.find_all("table")
print(len(states_table))
states_table

# WebScrapping through Application Programming Interface (API)

There are a lot of APIs available for each of the website. You can use these APIs to scrape websites like Twitter, Google Trends, etc. 

In this section, we will use a simple API provided by NASA, [here](http://open-notify.org/), to retrieve data about the International Space Station (ISS). 

Some of the content presented here is based on [dataquest](https://www.dataquest.io/blog/python-api-tutorial/). 

#### Current ISS position

In [None]:
import requests
response = requests.get("http://api.open-notify.org/iss-now.json")

print(response.status_code)
print(response.content)

There are various status codes that you will get when you request a website. [This](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) describes more detailed description. 

In [None]:
response = requests.get("http://api.open-notify.org/iss-now.json")
pd.read_json(response.content)

#### Current Number of People In Space

In [None]:
response = requests.get("http://api.open-notify.org/astros.json")
pd.read_json(response.content)

In [None]:
response.content

# Google Maps API

You need to install `googlemaps` package in order to use this. 

Select `Anaconda Prompt` on your computer and then type `pip install --user googlemaps`. This should install googlemaps package that we can use here. 

In [None]:
import googlemaps

from datetime import datetime

### Google API key

Ideally you would need to create this key from your google API dashboard by logging in with your google accounts. I have provided this key to for a dummy account. It comes with it's own restrictions. You may want to create this for your own accounts. 

**NOTE**: The below key `AIzaSyC7sJdwW-skSS0UOR-OFOHeGRNa8TwoM18` might be disabled after the class. You can create your won key using the link [here](https://support.google.com/googleapi/answer/6158862?hl=en). 

In [None]:
gmaps = googlemaps.Client(key='AIzaSyDCdQCVKWQNNhERNEmuufTwmhDeDszV1ws')

In [None]:
cities_zip = pd.read_csv("./data/uscities_zip.csv", index_col = ['city', 'state_id'])

In [None]:
cities_zip.head()

In [None]:
datetime_object = datetime.strptime('Apr 30 2022  3:00PM', '%b %d %Y %I:%M%p')
datetime_object

In [None]:
json_text = gmaps.distance_matrix((cities_zip.loc['Detroit', 'MI']['lat'], 
                                   cities_zip.loc['Detroit', 'MI']['lng']),
                                  
                                  (cities_zip.loc['Chicago', 'IL']['lat'], 
                                   cities_zip.loc['Chicago', 'IL']['lng']), 
                                    
                                  departure_time= datetime_object)

In [None]:
json_text