# Python Intermediate - Day 2
---
# Using Pandas
- `pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool
- `pandas` is built on top of `numpy` and therefore quite often used together with `numpy`

**import before you use**:
```
import numpy as np
import pandas as pd
```

# Check your Pandas version
Syntax:
```
print(pd.__version__)
```

# Some python handy features

**Createing array**

instead of typing `l = ['A', 'B', 'C', 'D', 'E']`
```
l = list('ABCDE')
l = 'A B C D E'.split()
```

**Generates a range of dates**:
```
dates = pd.date_range('20210101', periods=7) 
type(dates) # returns pandas.core.indexes.datetimes.DatetimeIndex
```

# Pandas `DataFrame`
A DataFrame is a two dimensional data structure

**To declare a dataframe**:
```
df = pd.DataFrame({
    "Name": ["Andy", "Ben", "Cathy", "Debra"],
    "Age": [20, 22, 23, 22],
    "Sex": ["male", "male", "female", "female"],
    "Year": [1, 3, 4, 3]
})
```

**Check the type**
```
print(type(df)) # it returns pandas.core.frame.DataFrame

```
**In real applications**
- DataFrame are seldom declared by code in the above approach.
- Instead, they are imported from external sources
- E.g.: csv, excel, json

# DataFrame information
```
df.shape # return the dimension
df.shape[0] # return the number of row
df.info()
df.describe()
```

# Retrieving a column (a.k.a Pandas Series)
Use a `[]` (square bracket) with a column name (in string) to indicate which column you want to retrieve

Example
```
df['Name']
name = df['Name']
print(type(name)) # returns pandas.core.series.Series
```

# Retrieving multiple columns
Use a `[]` (square bracket) with a column index name (in string) to indicate which column you want to retrieve

**Example**
```
columns = ['Name', 'Year'] # declare the columns you want in a list
df[columns] # pass the column list as paramenter to the dataframe
df[['Name', 'Year']] # Or simply put the list of column names directly
# Note is double squared
name_and_year = df[['Name', 'Year']]
print(type(name_and_year)) # Returns pandas.core.frame.DataFrame
```
**Below is a mistake**
```
name_and_year = df['Name', 'Year']
```

# Use `loc()` function to locate a row
Pandas use the loc attribute to return one or more specified row(s)
Example:
```
df.loc[0]
df.loc[1]
type(df.loc[0]) # returns pandas.core.series.Series
```

# Locating Multiple Rows
Provide a list of row number: `[0, 2]`

**Example**:
```
rows = [0, 2]
df.loc[rows]
```

**Or simplified one liner**:
```
df.loc[[0, 2]]
```

**Mistake**:
```
df.loc[0, 2]
```


## Retrieving a range of row
- Provide the starting row number and ending row number. 
- Use `:` to separate the starting row number and ending row number

Example:
```
df.loc[0:1]
df.loc[:2] # from position 0 to 2
df.loc[2:] # from position 2 to the end
```
Short hand:
```
df[0:1]
df[:2] # from position 0 to 1. End position exclusive.
df[2:] # from position 2 to the end
```


# Filtering Rows
- It scans through each row
- Perform a comparison operation on a column of each row 
- It returns `True` or `False` for comparison operation on that row

Example:
```
filter = df['Age']<23
print(filter)
df[filter]
```
or in one line
```
df[df['Age']<23]
```


# Filtering Rows with multiple filters
Example:
```
filter = df['Age']<23
filter2 = df['Year']==3
df[filter & filter2]
df[filter | filter2]
```
Concatenate the filter result for display
```
pd.concat([filter, filter2, (filter & filter2)], axis=1)
pd.concat([filter, filter2, (filter | filter2)], axis=1)

```

# Restricting rows and columns at the same time
You are apply both row filtering and columns retricting
Example:
```
df[df['Year']==3]
df[df['Year']==3][['Name', 'Year']]
```

# Use Pandas to retrieve HTML content
Example:
```
url = 'http://www.multpl.com/s-p-500-dividend-yield/table?f=m'
raw_html_tbl = pd.read_html(url)
type(raw_html_tbl)
len(raw_html_tbl)
raw_html_tbl[0]
type(raw_html_tbl[0])
```

# `pd.read_html()` doesn't always work
- There are too many broken HTML codes
- And some HTML are generated by JavaScript on the fly

**Example**: 

The following scraping won't work
```
url2 = 'https://stock360.hkej.com/marketWatch/Top20'
raw_html_tbl2 = pd.read_html(url2)
raw_html_tbl2
```

# Another Example or web scraping using Pandas
Retrieving currency table from Yahoo Finance
```
url3 = 'https://hk.finance.yahoo.com/currencies'
raw_html_tbl3 = pd.read_html(url3)
raw_html_tbl3[0]
```

# Reading CSV
Code example:
```
df = pd.read_csv('./data/Salaries.csv')
df.head() # return the first 5 rows
df.tail() # return the last 5 rows
df.head(10)
df.tail(10)
df.info() # the table meta data
df.describe() # basic statistic information
```
**Handling Empty Values**:
- The `info()` method also tells us how many Non-Null values there are present in each column
- Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. 
- This is a called cleaning data

# `pd.read_html()` doesn't always work
- There are two many broken HTML Codes
- Some HTML are generated by JavaScript on the fly
Example: The following scraping won't work
```
url2 = 'https://stock360.hkej.com/marketWatch/Top20'
raw_html_tbl2 = pd.read_html(url2)
raw_html_tbl2
```

# Cleaning Data
Data cleaning means fixing bad data in your data set.

**Bad data could be**:
- Empty cells
- Data in wrong format
- Wrong data
- Duplicates

# Dropping Columns
You might like to drop a few columns that you don't need
```
new_df = df.drop(columns=['Notes', 'Status'])
df.info()
new_df.info()
```

# Drop Rows with Empty Value
```
cleaned_df = new_df.dropna()
print(new_df.shape[0])
print(cleaned_df.shape[0])
cleaned_df.info()
cleaned_df.describe()
```

# Replacing empty cells with mean or median

```
df["BasePay"].describe()
x = df["BasePay"].mean()
df["BasePay"].fillna(x, inplace = True)
df["BasePay"].describe()
```


# Finding Relationship
The `corr()` method calculates the relationship between each column in your data set.
```
df.corr()
```
The Result of the `corr()` method is a table with a lot of numbers that represents how well the relationship is between two columns.
- The number varies from `-1` to `1`.
- `1` means that there is a 1 to 1 relationship (a perfect correlation).
- `0.9` is also a good relationship, and if you increase one value, the other will probably increase as well.
- `-0.9` would be just as good relationship as 0.9, but if you increase one value, the other will go down.
- `0.2` means NOT a good relationship, meaning that if one value goes up does not mean that the other will.

Example:
```
df = pd.read_excel('./data/Students.xls', sheet_name=0)
df
df.corr()
```

# Groupby: divide data into groups
```
df = pd.read_csv('./data/Salaries.csv')
df.info()
df.groupby('JobTitle')
df.groupby('Year')
group_by_year = df.groupby('Year')
group_by_year.groups
group_by_year.get_group(2011)
```

# Group Aggregated Data
An aggregated function returns a single aggregated value for each group.

Example:
```
group_by_year.agg(np.mean)
group_by_year.agg(np.max)
group_by_year.agg(np.min)
group_by_year.agg([np.sum, np.mean, np.std])
```

# Group by with multiple columns

Example:
```
df.groupby(['JobTitle', 'Year'])
df.groupby(['JobTitle', 'Year']).groups
df.groupby(['JobTitle', 'Year']).agg(np.mean)
```



# Merging Data
- **Concatenation**: combining together Series, DataFrame
- **Joining**: join operations idiomatically very similar to relational databases like SQL

**Concatenation**:
```
df1 = pd.read_excel('./data/Students.xls', sheet_name=0)
df2 = pd.read_excel('./data/Students.xls', sheet_name=0)
concatencate1 = pd.concat([df1, df2])
concatencate1
concatencate2 = pd.concat([df1, df2], axis=1)
concatencate2
```


**Joining**:
```
left = pd.read_excel('./data/Students.xls', sheet_name=0)
right = pd.read_excel('./data/Students.xls', sheet_name=1)
joined = pd.merge(left, right, on='Academic year')
joined
```

# Matplotlib
Matplotlib is a low level graph plotting library in python that serves as a visualization utility.

**import before you use**:
```
import matplotlib.pyplot as plt
```

**Samples**:
```
df = pd.read_excel('./data/Students.xls', sheet_name=0)
year = df['Academic year']
ug = df['Under-graduate']
plt.plot(year, ug)
plt.plot(year, ug, 'o')
```

# Styling theMarkers

Other markers to consider:

`'o'`	Circle	
`'*'`	Star	
`'.'`	Point	
`','`	Pixel	
`'x'`	X	
`'X'`	X (filled)	
`'+'`	Plus	
`'P'`	Plus (filled)	
`'s'`	Square	
`'D'`	Diamond	
`'d'`	Diamond (thin)	
`'p'`	Pentagon	
`'H'`	Hexagon	
`'h'`	Hexagon	
`'v'`	Triangle Down	
`'^'`	Triangle Up	
`'<'`	Triangle Left	
`'>'`	Triangle Right	
`'1'`	Tri Down	
`'2'`	Tri Up	
`'3'`	Tri Left	
`'4'`	Tri Right	
`'|'`	Vline	
`'_'`	Hline

# Configuring marker size
You can use the keyword argument markersize or the shorter version, ms to set the size of the markers:
Example:
```
plt.plot(year, ug, 'D', ms=10)
```