# Data module class 2
Reading documentation: Pandas and BeautifulSoup

In [21]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [86]:
# download and import BeautifulSoup if you need to
# !pip install beautifulsoup4

## Pandas
### Terminology reference
#### Data structures
##### 1-dimensional data (create Series)

|pandas abbreviation|definition|example|
|---|---|---|
|dict|Python dictionary|`{'a': 'value', 'b': 'value'}`|
|ndarray|N-dimensional array (can be 1 or 2 dimensional)|`[0, 1, 2, 3]`|
|scalar|Single value|`100`|
|list|Python list|`[0, 1, 2, 3]`|

##### 2-dimensional data (create DataFrames)

|pandas term|example|
|---|---|
|ndarray|`[[0, 1, 2, 3], [4, 5, 6, 7]]`|
|dict of ndarrays|`{'one': [1, 2, 3, 4], 'two': [4, 3, 2, 1]}`|
|list of dicts|`[{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}]`|

#### How do these look when loaded in pandas?
[Taken from the Pandas User Guide](https://pandas.pydata.org/docs/user_guide/dsintro.html)

In [55]:
pd.Series({'a': 'value', 'b': 'value'})

a    value
b    value
dtype: object

In [56]:
pd.Series([0, 1, 2, 3])

0    0
1    1
2    2
3    3
dtype: int64

In [57]:
pd.Series(5)

0    5
dtype: int64

In [80]:
pd.DataFrame([{'id': 1, 'info': 'text'}, {'id': 2, 'info': 'more text'}])

Unnamed: 0,id,info
0,1,text
1,2,more text


In [81]:
pd.DataFrame([[0, 1, 2, 3], [4, 5, 6, 7]])

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7


#### Other terms
[See pd.DataFrame() as an example](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime)

- parameters
    - Information that a function accepts 
- args
    - Arguments that are required (or things that the function needs in order to run)
    - i.e. data for your DataFrame
- kwargs (even though Pandas does not identify them as such)
    - Keyword arguments: optional arguments not necessary for a function to run, but will tell the function to behave in a different way than the default. Called "keyword" arguments because you have to identify the name of the variable
    - i.e. errors='raise'

### 1. Let's practice input/output with Pandas with the following links.
Use Panda's [IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) section of their documentation to grab these datasets

- [Avengers Wikia data - FiveThirtyEight](https://raw.githubusercontent.com/fivethirtyeight/data/master/comic-characters/marvel-wikia-data.csv) | [Documentation here](https://github.com/fivethirtyeight/data/tree/master/avengers)
- [List of sovereign states - Wikipedia](https://en.wikipedia.org/wiki/List_of_sovereign_states)
- [Homeless housing - LA Times](https://raw.githubusercontent.com/kyleykim/R_Scripts/master/la-me-ln-hhh-unequal/revised_data/master_data_geocoded.csv) | [Documentation](https://github.com/kyleykim/R_Scripts/tree/master/la-me-ln-hhh-unequal)

### 2. Let's practice working with missing data and selecting these values
#### For each DataFrame, either select all the missing values of one column or select a unique categorical value.
The [Indexing and selecting data¶](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) section of Pandas documentation will help

#### a. Avengers

#### b. Countries

#### c. LA homeless housing

### 3. Let's practice cleaning with intent

#### Use each the three datasets loaded in to generate a question you want to answer with the data
##### Tips
- Show the column list the column types and null values
- Find unique values to look at categorical data

#### a. Avengers
##### Question
- _your question here_

##### What cleaning do I need to do to answer the question
- 
- 
- 

In [1]:
# show the dataframe info here to get you started 

#### b. Countries
##### Question
- _your question here_

##### What cleaning do I need to do to answer the question
- 
- 
- 

#### c. LA homeless housing
##### Question
- _your question here_

##### What cleaning do I need to do to answer the question
- 
- 
- 

Take a look at the [LA Times'](https://github.com/datadesk/notebooks) or [FiveThirtyEight's](https://github.com/fivethirtyeight/data) for more practice

## BeautifulSoup
[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [2]:
# load in the HTML and format for BS

In [3]:
# finds the title tag

In [4]:
# grab the first a tag

In [5]:
# finds all a tags

In [6]:
# find all elements with the class "mw-jump-link"

### Traverse the DOM

In [74]:
# we know the table we want is the first table in the DOM
# then we want to to read tr tags as groups of cells in a row and td tags as cells

In [7]:
# find where the data you want resides (a tag, class name, etc)

In [8]:
# find_all tr

In [9]:
# separate the first tr tag row for the header

In [10]:
# for each tr, find tds then for each td get text inside, then save to new array

### We can do more cleaning here