# Data module class 2
Reading documentation: Pandas and BeautifulSoup

In [11]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [13]:
# download and import BeautifulSoup if you need to
# !pip install beautifulsoup4

## Pandas
### Terminology reference
#### Data structures
##### 1-dimensional data (create Series)

|pandas abbreviation|definition|example|
|---|---|---|
|dict|Python dictionary|`{'a': 'value', 'b': 'value'}`|
|ndarray|N-dimensional array (can be 1 or 2 dimensional)|`[0, 1, 2, 3]`|
|scalar|Single value|`100`|
|list|Python list|`[0, 1, 2, 3]`|

##### 2-dimensional data (create DataFrames)

|pandas term|example|
|---|---|
|ndarray|`[[0, 1, 2, 3], [4, 5, 6, 7]]`|
|dict of ndarrays|`{'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1]}`|
|list of dicts|`[{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}]`|

#### How do these look when loaded in pandas?
[Taken from the Pandas User Guide](https://pandas.pydata.org/docs/user_guide/dsintro.html)

In [14]:
pd.Series({'a': 'value', 'b': 'value'})

a    value
b    value
dtype: object

In [56]:
pd.Series([0, 1, 2, 3])

0    0
1    1
2    2
3    3
dtype: int64

In [57]:
pd.Series(5)

0    5
dtype: int64

In [18]:
pd.DataFrame([{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}])

#### Other terms
[See pd.to_datetime() as an example](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime)

#### parameters: Information that a function accepts 
- args
    - Arguments that are required (or things that the function needs in order to run)
    - i.e. data for your DataFrame
- kwargs (even though Pandas does not identify them as such)
    - Keyword arguments: optional arguments not necessary for a function to run, but will tell the function to behave in a different way than the default. Called "keyword" arguments because you have to identify the name of the variable
    - i.e. errors='raise'

### 1. Let's practice input/output with Pandas with the following links.
Use Panda's [IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) section of their documentation to grab these datasets

- [Avengers Wikia data - FiveThirtyEight](https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv) | [Documentation here](https://github.com/fivethirtyeight/data/tree/master/avengers)
- [List of sovereign states - Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states)
- [Homeless housing - LA Times](https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv) | [Documentation](https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-hhh-unequal)

### 2. Let's practice working with missing data and selecting these values
#### For each DataFrame, either select all the missing values of one column or select a unique categorical value.
The [Indexing and selecting data¶](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) section of Pandas documentation will help

#### a. Avengers

#### b. Countries

#### c. LA homeless housing

### 3. Let's practice cleaning with intent

#### Use each the three datasets loaded in to generate a question you want to answer with the data
##### Tips
- Show the column list the column types and null values
- Find unique values to look at categorical data

#### a. Avengers
##### Question
- How many Avengers have blonde hair? 

##### What cleaning do I need to do to answer the question
- Look up what options of hair color are available (categories)
- ID different spellings
- Check count

In [42]:
# show the dataframe info here to get you started 
df_avengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16376 entries, 0 to 16375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   page_id           16376 non-null  int64  
 1   name              16376 non-null  object 
 2   urlslug           16376 non-null  object 
 3   ID                12606 non-null  object 
 4   ALIGN             13564 non-null  object 
 5   EYE               6609 non-null   object 
 6   HAIR              12112 non-null  object 
 7   SEX               15522 non-null  object 
 8   GSM               90 non-null     object 
 9   ALIVE             16373 non-null  object 
 10  APPEARANCES       15280 non-null  float64
 11  FIRST APPEARANCE  15561 non-null  object 
 12  Year              15561 non-null  float64
dtypes: float64(2), int64(1), object(10)
memory usage: 1.6+ MB


In [46]:
# Look up what options of hair color are available (categories)
pd.DataFrame(df_avengers['HAIR'].unique())

Unnamed: 0,0
0,Brown Hair
1,White Hair
2,Black Hair
3,Blond Hair
4,No Hair
5,Blue Hair
6,Red Hair
7,Bald
8,Auburn Hair
9,Grey Hair


In [47]:
# ID different spellings
# Let's create a broader category for hair

In [48]:
# Check count
len(df_avengers[df_avengers['HAIR'] == 'Blond Hair'])

1582

#### b. Countries
##### Question
- _your question here_

##### What cleaning do I need to do to answer the question
- 
- 
- 

#### c. LA homeless housing
##### Question
- _your question here_

##### What cleaning do I need to do to answer the question
- 
- 
- 

Take a look at the [LA Times'](https://github.com/datadesk/notebooks) or [FiveThirtyEight's](https://github.com/fivethirtyeight/data) for more practice

## BeautifulSoup
[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [50]:
# load in the HTML and format for BS
sp_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

In [51]:
sp_r = requests.get(sp_url)

In [58]:
sp_bs = BeautifulSoup(sp_r.text)

In [60]:
# sp_bs

In [61]:
# finds the title tag
sp_bs.title

<title>List of S&amp;P 500 companies - Wikipedia</title>

In [62]:
# grab the first a tag
sp_bs.a

<a id="top"></a>

In [65]:
# finds all a tags
len(sp_bs.find_all('a'))

3562

In [66]:
# find all elements with the class "mw-jump-link"
sp_bs.find_all(class_='mw-jump-link')

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

### Traverse the DOM

In [71]:
# we know the table we want is the first table in the DOM
sp_tables = sp_bs.find_all('table')
sp_tables = sp_tables[0]

In [69]:
len(sp_tables)

2

In [72]:
# then we want to to read tr tags as groups of cells in a row and td tags as cells
sp_trs = sp_tables.find_all('tr')

In [86]:
sp_list = []
for tr in sp_trs[1:]:
    tds = tr.find_all('td')
    tr_list = []
    for (i, td) in enumerate(tds):
        if(i == 2):
            tr_list.append(td.find('a')['href'])
        else:
            tr_list.append(td.text)
    sp_list.append(tr_list)

In [87]:
sp_df = pd.DataFrame(sp_list)

In [88]:
[{'url': 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=M'}]
td['url']
[0, 1, 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=M']

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,MMM\n,3M,https://www.sec.gov/cgi-bin/browse-edgar?CIK=M...,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,0000066740,1902\n
1,ABT\n,Abbott Laboratories,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,0000001800,1888\n
2,ABBV\n,AbbVie,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,0001551152,2013 (1888)\n
3,ABMD\n,Abiomed,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,0000815094,1981\n
4,ACN\n,Accenture,https://www.sec.gov/cgi-bin/browse-edgar?CIK=A...,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,0001467373,1989\n
...,...,...,...,...,...,...,...,...,...
500,YUM\n,Yum! Brands,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Y...,Consumer Discretionary,Restaurants,"Louisville, Kentucky",1997-10-06,0001041061\n,1997\n
501,ZBRA\n,Zebra Technologies,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Information Technology,Electronic Equipment & Instruments,"Lincolnshire, Illinois",2019-12-23,0000877212\n,1969\n
502,ZBH\n,Zimmer Biomet,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Health Care,Health Care Equipment,"Warsaw, Indiana",2001-08-07,0001136869\n,1927\n
503,ZION\n,Zions Bancorp,https://www.sec.gov/cgi-bin/browse-edgar?CIK=Z...,Financials,Regional Banks,"Salt Lake City, Utah",2001-06-22,0000109380\n,1873\n


In [75]:
# sp_trs[0]

In [7]:
# find where the data you want resides (a tag, class name, etc)

In [8]:
# find_all tr

In [9]:
# separate the first tr tag row for the header

In [10]:
# for each tr, find tds then for each td get text inside, then save to new array

### We can do more cleaning here