# Important Python Libraries
- Numpy: Stands for Numerical Python and is used for numerical analysis including lineart algebra, Fourier Transforation, and other analsis. 
- Pandas: Stands for Panel Data. It is useed for strucutred data operaitons and manilpoutions. It is the most common libary for data analytics.
- Scipy: Stands for Scientific Python. The `stats` methods in scipy are useful for classical statistics tests.
- Matplotlib: Stands for Matlab Plotting Library. It can create basic 2D and 3D graphs.


# Loading data into Python

## From a CSV file
`import pandas as pd
df = pd.read_csv("<Path to Data>.csv")`

Note that the path to the data must be in quotation marks and is relative to wherever your notebook is running.

### From a TXT tile
Sometimes data is separated by tabs not commas. For this data, use the `sep` parameter to specify the data is separated by a tab. This also works for data separated by other things like a pipe (`|`).

`import pandas as pd
df = pd.read_csv("<Path to Data>.csv", sep = '\t')`

## From an Excel file

`import pandas as pd
df = pd.read_excel("<Path to Data>", "Worksheet_Name")`

Note that the you need to provide the worksheet name otherwise it will default to the last worksheet that was opened. This will give you strange results if you have previously done something like a regression in the file. Like with CSVs, you must put the path to the data must be in quotation marks.

# Exploring dataframe

- Look at the top 5 records: `df.head()`
- Look at the bottom 5 records: `df.tail()`
- View column names: `df.columns`

## Rename columns
This passes a dictionary to rename from an old column name to a new column name.
`df_new_names = df.rename(columns = {"old_name": "new_name"})`

The parameter `inplace = True` will save the change on the original dataframe.
`df.rename(columns = {"old_name": "new_name"}, inplace = True)`


# Selecting columns or rows

- Subsetting dataframe: `df[["column1", "column2"]]`
- Filtering records based on one column: `df[df["column1"]>10]`
- Two conditions that both must be true: `df[df["column1"]>10 & df["column2"]<40]`
- Two conditions where only one must be true: `df[df["column1"]>10 | df["column2"]<40]`

# Convert data types

To see the data type for a column in a dataframe use:
    
`df.dtypes`

## Convert numberic variables to strings
To cast a variable in a specific data type use:
    
`string_output = str(numeric_input)
integer_output = int(string_input)
float_output = float(string_input)`

example:
`df["value_as_int"] = int(df["value_as_string"])`

## Convert Date to datetime format
`df["date_as_datetime"] = pd.to_datetime(df["date_as_string"])`

# Reshaping data

## Transpose data

Changing data from long to wide format:

`df = pd.read_csv("<Path to data>.csv")
df_pivot = df.pivot(index = "<index variable>", columns = "<Column to brake up and convert to wide format>", values = "<whatever value you are trying to see>")`

## Sorting a Dataframe

Assending tells python to sort smallest to greatest. If you want largest to smallest change the value from `True` to `False`.

`df = pd.read_csv("<Path to data>.csv")
df_sorted = df.sort(["<first column to sort>", "<second column to sort>"], assending = [True, True])`

# Basic plots using Matplotlib

These are basic graphs. You can make prettier graphs using other libaries like Seaborn or Plotly but Matplotlib is the basis for most python libraries.

## Histogram

Simple hist using pandas: `df["<column>"].hist()`

More advanced hist where you can edit more of the plot features:

`import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("data/file.csv")
fig = plt.figure()
ax = fig.add_sublplot(1,1,1)
#Variable
ax.hist(df['variable to plot'], bins = 5)
plt.title("<Title>")
plt.xlabel("<Label for x axis>")
plt.ylabel("<Label for y axis>")
plt.show()`

## Boxplot

Simple hist using pandas: `df["<column>"].boxplot()`

`fig = plt.figure()
ax = fig.add_sublplot(1,1,1)
#Variable
ax.boxplot(df['X variable'])
plt.title("<Title>")
plt.xlabel("<Label for x axis>")
plt.ylabel("<Label for y axis>")
plt.show()`

## Scatterplot

A basic scatterplot:

`fig = plt.figure()
ax = fig.add_sublplot(1,1,1)
#Variable
ax.scatter(df['X variable'], df['Y variable'])
plt.title("<Title>")
plt.xlabel("<Label for x axis>")
plt.ylabel("<Label for y axis>")
plt.show()`



# Group and summarize data


## Groupby
Groupby helps with three operations:

1. Split the data into groups
2. Apply a function to each group
3. Combine the result into a data stucture


- `df_var1 = df.groupby(["Var 1"])`
- `df_var1_mean = df.groupby(["Var 1"]).mean()`
- `df_count = df.groupby(["Var 1", "Var 2"]).count()`

You can group by Var and calcuate the sum of a column using: 
`df_var2 = df.groupby(["Var 1"])["Var 2"].sum()`

## Pivot_table

Pivot table helps generate data structure. It has three parts: index, columns, and values.

- `pd.pivot_table(df, values = 'Col 1', index = ['Col 2', 'Col 3'], columns = ['Col 4', 'Col 5'])`
- with aggregation: `pd.pivot_table(df, values = 'Col 1', index = ['Col 2', 'Col 3'], columns = ['Col 4', 'Col 5'], aggfunc=mean)`

## Cross tab
cross tab computes simple tabulation of two factors.
`pd.crosstab(df['Col 1'], df['Col 2'])`

# Cleaning data

## Deduplicating data

remove duplicates from the data:
    
`df_dedup = df.drop_duplicates(["Var 1", "Var 2"])`

## Dealing with missing data

- Find missing values: `df.isnull()`
- Drop `NaN` values: `df.dropna()`

### Impute missing values using column mean
This will fill all the missing values in a column with the average of that column.
`import pandas as pd
meanVar1 = df["Var1"].mean()
df["Var1"] = df["Var1"].fillna(meanVar1)`

### Fill missing value with a static number

Replace missing values with 5: `df["Var1"] = df["Var1"].fillna(5)`

# Merge two dataframes

## Concatenate
Concatenate two or more dataframes based on their columns.

`pd.concat([df1, df2, df3])`

## Merge

This merges (aka joins) df1 and df2. `how` defines that only matches on both datasets will be kept.
`df_new = pd.merge(df1, df2, how = 'inner', left_index = True, right_index = True)`



# Creating new columns

- `Col 2` is the `Col 1` plus a value (X): `df['Col 2'] = df['Col 1'] + X`
- `Col 2` is sum of `Col 1` and `Col 0`: `df['Col 2'] = df['Col 1'] + df['Col 0']`
- `Map` a new function to a column. In this case, adding 5 to each value: `df['Col_new'] = df['Col 1'].map(lambda x: 5 + x)`
- `Apply` a function to column or columns. In this case, adding three columns: `df['Col_new'] = df[['Col 1', 'Col 2', Col 3']].apply(sum)`

# Basic statistics

## Describe

- summarize data: `df.describe()`

## Other things

- mean: `df.mean()`
- standard deviation: `df.std()`
- covariance: `df.cov()`
- correlation: `df.corr()`
- find unique values: `df.unique()`    
