# Kevin's Pandas' Crib Sheet


This is a consolidation notes and examples from:
> Coreys MSchafer's Pandas videos [here](https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS) 

and 
> Hands on Data Analysis by xxx

Version 2.0W

## 0. Set-Up 

In [75]:
import pandas as pd
import numpy as np
import datetime as dt
import pprint

In [76]:
people = {
    'first': ['Corey', 'Jane', 'Janey', 'John', 'Jimmy'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Doe', 'Doe'], 
    'email': ["CoreyMSchafer@gmail.com", 'JaneDoe@email.com', 'JaneyDoe@email.com','JohnDoe@email.com', 'JimmyDoe@email.com']
}
print(f'{people=}')
# print(people)
people2 = {
    'first': ['Tony', 'Steve'], 
    'last': ['Stark', 'Rogers'], 
    'email': ['IronMan@avenge.com', 'Cap@avenge.com']
}

people={'first': ['Corey', 'Jane', 'Janey', 'John', 'Jimmy'], 'last': ['Schafer', 'Doe', 'Doe', 'Doe', 'Doe'], 'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JaneyDoe@email.com', 'JohnDoe@email.com', 'JimmyDoe@email.com']}


## 1.  Making a dataframe

In [77]:
df =  pd.DataFrame(people)
df2 = pd.DataFrame(people2)
df

Unnamed: 0,first,last,email
0,Corey,Schafer,CoreyMSchafer@gmail.com
1,Jane,Doe,JaneDoe@email.com
2,Janey,Doe,JaneyDoe@email.com
3,John,Doe,JohnDoe@email.com
4,Jimmy,Doe,JimmyDoe@email.com


## 2. Quick Overview of the Data 

In [78]:
# df.info()             # Overview of the dataframe
# df.columns            # List column names
df.describe()           # Quick summart of the frame, best for wide format.


Unnamed: 0,first,last,email
count,5,5,5
unique,5,2,5
top,Corey,Doe,CoreyMSchafer@gmail.com
freq,1,4,1


## 3. Indexes


In [79]:
# Set a new index. Keep it set with `inplace``.  
# Indexes don't have to be unique
df.set_index('email', inplace=True)     # Set a column to be an index
print(df.index)
df.reset_index(inplace=True)            # Reset row indexes to (hand to 'save'a column used a an index)

Index(['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JaneyDoe@email.com',
       'JohnDoe@email.com', 'JimmyDoe@email.com'],
      dtype='object', name='email')


## 4. Accessing Data 

In [80]:
# df                                # Simple access
# df['email']                       # Access single column
# df[['last', 'email']]             # Access multiple columns by using a list (a list within the list)i

# df.iloc[[0, 1], 2]                # Access by integer reference / index by using .iloc.  .loc and iloc takes row index first

# df.loc['CoreyMSchafer@gmail.com', 'last'] # Access by row index name .loc
# df.loc[[0, 1], ['email', 'last']] # As above plus selected columns 

## 5. Selecting Data


### 5.1 Filters 

Best to filter with 2 part process:
1. Set filter 
2. Apply filter

_But can't use word "filter" as a variable name it's reserved_

In [81]:
filt = (df['last'] == 'Schafer') | (df['first'] == 'John')  # 1) Set filter.  An exampe of an 'or' '|' filter
df.loc[filt, 'email']                                       # 2) Apply filter or
# df.loc[~filt, 'email']                                    # 2) Apply inverse of filter

0    CoreyMSchafer@gmail.com
3          JohnDoe@email.com
Name: email, dtype: object

## 6. Updating Values

### 6.1 Update Column Names


In [82]:
# df.columns = ['email', 'first_name', 'last_name']         # Rename all columns 

# df.rename(                                                # Rename specific columns using .rename
#     columns={
#         'first_name': 'first', 'last_name': 'last'
#         }, inplace=True                                   # Note, need "inplace" 
#     ) 
 
# df.columns = [x.upper() for x in df.columns]              # Rename all columns by an inline comprehension .columns

# Reset
df.columns = [x.lower() for x in df.columns]                # Reset so later examples work
df

Unnamed: 0,email,first,last
0,CoreyMSchafer@gmail.com,Corey,Schafer
1,JaneDoe@email.com,Jane,Doe
2,JaneyDoe@email.com,Janey,Doe
3,JohnDoe@email.com,John,Doe
4,JimmyDoe@email.com,Jimmy,Doe


### 6.2 Update Values - Direct Updates

In [83]:
df['email'] = df['email'].str.lower()                               # Update whole column with string object method with.str.x
df.loc[3] = ['John2Smith@email.com', 'John2', 'Smith']              # Update whole row with .loc
df.loc[2, ['last', 'email']] = ['Smith', 'janeysmith@email.com']    # Update specific columns of a row with .loc

# Update based on filter 
filt = (df['email'] == 'John2Smith@email.com')                      # Update cells based on a filter with .loc
# df[filt]['last'] = 'Smith'                                        # DON'T do this, it won't work
df.loc[filt, 'first'] = 'Johnny'                                    # THIS will, need .loc

df

Unnamed: 0,email,first,last
0,coreymschafer@gmail.com,Corey,Schafer
1,janedoe@email.com,Jane,Doe
2,janeysmith@email.com,Janey,Smith
3,John2Smith@email.com,Johnny,Smith
4,jimmydoe@email.com,Jimmy,Doe


### 6.3 Updating Values - with Functions 

Four Functions:
- `apply`
- `map`
- `applymap` &
- `replace`

#### 6.3.1 `apply` a function to an object (dataframe or series) and get a series as a result
- Object can be a series (by default a column) 
- Object can be a dataframe in which case it's applied to each series (column) for a single result for each


In [84]:
# Applying to a column
# df['email'].apply(len)            # `apply` the `len` function to the email column

# def update_email(email):          # 'apply' your own function
#     return email.upper()
# df['email'].apply(update_email) 

# df['email'].apply(                # 'Apply' a your own inline (LAMBDA) function 
#     lambda x: x.lower()           # to a whole column and get a series as a result
#     )  

# When applied to a dataframe 'apply' is applied across each series
df.apply(len) # or df.apply(len, axis='columns') or df.apply(len, axis='rows')   
# df.apply(pd.Series.min)           # Returns the minimum (first in alaphs) in each column

# df.apply(                           # Applying a Lambda function to each series
#     lambda x: x.min()
#     )     


email    5
first    5
last     5
dtype: int64

#### 6.3.2 `applymap` a function to a dataframe and get a dataframe as a result.  
Applied elementwise


In [85]:
# df.applymap(len)
df.applymap(str.lower)

Unnamed: 0,email,first,last
0,coreymschafer@gmail.com,corey,schafer
1,janedoe@email.com,jane,doe
2,janeysmith@email.com,janey,smith
3,john2smith@email.com,johnny,smith
4,jimmydoe@email.com,jimmy,doe


#### 6.3.3 `map` a series and get a series as a result.  
Replaces __all__ elements in series  

In [86]:
# .map only works on a series. Use like a vlookup
# Use it to subsitute one value for another via a lookup dictionary.
# Unsubtituted vales replaced by NaN
df['first'].map({'Corey': 'Chris', 'Jane': 'Mary'})

0    Chris
1     Mary
2      NaN
3      NaN
4      NaN
Name: first, dtype: object

#### 6.3.4 `replace` a series and get series result

In [87]:
# .replace works like map but leaves unsubsittuted values untouched (not NaN)
df['first'] = df['first'].replace({'Corey': 'Corey2', 'Jane': 'Jane2'})
df

Unnamed: 0,email,first,last
0,coreymschafer@gmail.com,Corey2,Schafer
1,janedoe@email.com,Jane2,Doe
2,janeysmith@email.com,Janey,Smith
3,John2Smith@email.com,Johnny,Smith
4,jimmydoe@email.com,Jimmy,Doe


## 7. Updating Shape

### 7.1 Columns

#### 7.1.1 Adding Columns

In [88]:
# Can't use . notation as pandas would look for method

# Create multiple columns at once 
# df[['first', 'last']] = df['full_name'].str.split(' ', expand=True)

# Creating a new column with strings, can use numeric as well with .apply 
df['full_name'] = df['first'] + ' ' + df['last']

# Split data with str.split.  Splits on space by default so not needed
# would give list by default, need expand=True to make 2 new columns in dataframe
df['full_name'].str.split(' ', expand=True)

df

Unnamed: 0,email,first,last,full_name
0,coreymschafer@gmail.com,Corey2,Schafer,Corey2 Schafer
1,janedoe@email.com,Jane2,Doe,Jane2 Doe
2,janeysmith@email.com,Janey,Smith,Janey Smith
3,John2Smith@email.com,Johnny,Smith,Johnny Smith
4,jimmydoe@email.com,Jimmy,Doe,Jimmy Doe


#### 7.1.2 Dropping Columns

In [89]:
# Remove columns with .drop like a db
df.drop(columns=['first', 'last'], inplace=True)
df

Unnamed: 0,email,full_name
0,coreymschafer@gmail.com,Corey2 Schafer
1,janedoe@email.com,Jane2 Doe
2,janeysmith@email.com,Janey Smith
3,John2Smith@email.com,Johnny Smith
4,jimmydoe@email.com,Jimmy Doe


### 7.2 Rows

#### 7.2.1 Adding Rows

In [None]:
# Adding a single row with .append
# df.append({'first': 'Tony'}, ignore_index=True) # insert new row even if no index given: ignore_index=True

# Now deprecated:
df2 = pd.DataFrame({'first': ['Tony']})
pd.concat([df, df2])

#### 7.2.2 Dropping Rows


In [90]:
df.drop(index=3, inplace=True)              # Deleteing a row with .drop

filt = df['full_name'] == 'Jane2 Doe'                # Dropping rows based on values.  This case index
df.drop(index=df[filt].index, inplace=True)


In [None]:
# Deleting rows based on values 


### 7.3 Dataframes

In [91]:
df1 = pd.concat([df, df2], ignore_index=True, sort=False) # Adding a whole new dataframe as new rows
df

Unnamed: 0,email,full_name
0,coreymschafer@gmail.com,Corey2 Schafer
2,janeysmith@email.com,Janey Smith
4,jimmydoe@email.com,Jimmy Doe


## 8. Sorting

### 8.1 Sort a Series 

In [96]:
df['email'].sort_values()    # Sort a series (column) with .sort_values 

0    coreymschafer@gmail.com
2       janeysmith@email.com
4         jimmydoe@email.com
Name: email, dtype: object

### 8.2 Sort a Dataframe

In [101]:
# df.sort_values(by='email', ascending=False)   # Sort a dataframe by a single column with sort_values

df.sort_values(                                 # Sort a dataframe by a multiple columnsin a list with .sort_values
    by=['email', 'full_name'], 
    ascending=False)  

# df.sort_values(                               # Sort a dataframe by a multiple columns in a list with .sort_values 
#     by=['email', 'full_name'],                # and different asending attrbutes from a list and make perm with inpace 
#     ascending=[False, True], 
#     inplace=True  
#     )

df.sort_index()                               # Reset the order based on the "original" index with .sort_index

Unnamed: 0,email,full_name
0,coreymschafer@gmail.com,Corey2 Schafer
2,janeysmith@email.com,Janey Smith
4,jimmydoe@email.com,Jimmy Doe


## 9. Aggregates

In [None]:
# Lessson 8 Aggregates
# Corey uses large data set, I'm just adding extra numeric colums to the existing one.
df['numeric_data_01'] = np.random.randint(0,100, size=len(df))
df['numeric_data_02'] = np.random.randint(0,100, size=len(df))

# Use aggregation functuins, such as mean, mode, standard deviation etc on a simgle column
df[['numeric_data_01', 'numeric_data_02']].median()

# Count the number of populated fields in a column with .count
df['last'].count()

# Count the number of eachvalue with .value_counts 
df['last'].value_counts()

# Or to get a percentage use the normalise=True attribute
df['last'].value_counts(normalize=True)*100


## 10. Groups


In [None]:
# Create a group in a similar way as we created a filter, but with .groupby([column_name])
# This gives you a group object, indexed by the group rather than true / galse list of a filter
grp_last = df.groupby(['last'])
grp_last.groups                # KT added to see groups and indexes

# Then apply methods to the group in a 2nd step, e.g., .get_group 
grp_last.get_group('Smith')

# Apply a function (.value_counts) to a column after already being grouped
# Can filter furtther with .loc makes it loke usiong a filter
# Can also get percentage like above with (normalize=True)*100
grp_last['first'].value_counts() #.loc['Smith']

# Can retrive multiple columns and perform other aggregate functions with their methods 
grp_last[['numeric_data_01', 'numeric_data_02']].median() #.loc[['Smith' , 'Doe']]

# *** Or use more generic form to apply multiple aggregated functions with .agg ***
# Seems most generic to me!!!
grp_last[['numeric_data_01', 'numeric_data_02']].agg(['count', 'mean', 'std']) #.loc[['Smith' , 'Doe']]

# Counting rows with filter.  Counts true's in the returned series with .sum
filt = df['last'] == 'Doe'
df.loc[filt]['first'].str.contains('Jane').sum()

# But fora group need to .apply the function to all the group's series 
grp_last['first'].apply(lambda x: x.str.contains('n').sum())

In [None]:
# How to find the percentage with an n in their first name and group by surname

# Create a series of the number of people with each surname
surname_count = df['last'].value_counts()
surname_count

# Create a series of people with each surname, with 'n' in first name
surname_count_with_n = grp_last['first'].apply(lambda x: x.str.contains('n').sum())
surname_count_with_n

# Merge the 2 series togther, add and calculate the percentage (answer column) and tidy up column names
df_with_n = pd.concat([surname_count, surname_count_with_n], axis='columns', sort=False)
df_with_n['percentage'] = df_with_n['first']/df_with_n['last']*100
df_with_n.rename(columns={'first': 'First_with_an_n', 'last': 'Surname'}, inplace=True)
df_with_n.sort_values('percentage', ascending=False)
# df_with_n.loc['Smith']

## 11. Cleaning 

In [None]:
# Set-up some dirty data  
people = {
    'first': ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
}
df = pd.DataFrame(people)
df
# GOOD IDEA look for unique values in columns to see if you're likely to get problems 
# for i in df.columns:
#     print(f'\n{i}')
#     print(df[i].unique())

# Identify na values (by getting a mask) rather than drop them with .isna
df.isna()
# or
df.isna().sum()

# Cleaning. Replace unusual nill values across whole data frame
# Could do all this at import time for csv pd.read_csv(XXXXX..., na_values=['NA','Missing'])
df.replace('NA', np.nan, inplace=True)
df.replace('Missing', np.nan, inplace=True)
df

# Cleaning. Replaces NaN values with an actual value.  Most usful for NUMERIC data
df.fillna(0)

# Drop any / all rows that aren't totally complete with .dropna & how = 'any'
# default values are: df.dropna(axis='index', how='any')
df.dropna()

# Drop incomplete columns.  Which is all of them due to row 4
df.dropna(axis='columns')

# Drop rows that have missing data in either ('any') specified rows with how='' & subset=[]
df.dropna(axis='index', how='any', subset=['last', 'email'])

# Identify if data type is correct.  If numeric are wrong many aggrate functions won't work 
df.dtypes

# Cleaning. Casting a column to the correct data type with .astype
# Can use .astype on whole dataframe too.
# Use float not int, as NaN is a float.
df['age'] = df['age'].astype(float)
df.dtypes
# df['age'].mean()

## 12. Datetime 

In [None]:
# Lesson 10 - Date Time Series
# Can use ,format= if .to_datetime doesn't auto recognise the date / time format
df = pd.read_csv('time_series.csv')
df['Date']=pd.to_datetime(df['Date'], format='%Y-%m-%d %I-%p')
df.dtypes

In [None]:
# To find day name for single cell with .day_name() method
df.loc[0, 'Date'].day_name()

In [None]:
# New column comtaining day name with .dt.day_name()
df['DayOfWeek'] = df['Date'].dt.day_name()
df

In [None]:
# Some date functions
print(df['Date'].min())
print(df['Date'].max())
print(df['Date'].max() - df['Date'].min()) # Known as time delta

In [None]:
# Filtering on date range in str converted to a datetime with .to_datetime
filt = (df['Date'] >= pd.to_datetime('2019-01-01')) & (df['Date'] < pd.to_datetime('2020-01-01'))
df.loc[filt]

In [None]:
# Setting date column as an index for later functions
df.set_index('Date', inplace=True)

In [None]:
# Single value slice on index with .loc
df.loc['2019']

In [None]:
# Slice on index with .loc and for range :
df.loc['2020-01':'2020-02']

In [None]:
# Get an aggregate value (eg mean or max) of a column sliced by date 
print(  df.loc['2020-01':'2020-02']['Close'].mean() )
print(  df.loc['2020-01-01']['High'].max()  )

In [None]:
# Resample (downsample) a range using 'D' for day and .resample
highs = df['High'].resample('D').max()
highs

In [None]:
# Quick line plot with mathplot & a Magic command needed for Jupyter notebook
%matplotlib inline 
highs.plot()

In [None]:
# Resample whole dataframe with single aggregation method
df.resample('W').mean()

In [None]:
# Resample whole dataframe with diferent aggregations with a map & .agg method
df.resample('W').agg({'Close': 'mean', 'High': 'max', 'Low': 'min', 'Volume': 'sum'})

## 13. Data Sources

In [None]:
# Lesson 11: Reading and Writing to Sources

In [None]:
# Working with csv's 
df = pd.read_csv('time_series.csv', index_col='Date') # Load in the csv

filt = (df['Volume'] > 1_000_000)                           # Do some processing
df_big_trade_days =  df.loc[filt]                           # Do some processing
# 
df_big_trade_days.to_csv('output.csv')                      # Save as csv 
df_big_trade_days.to_csv('output.tsv', sep='\t')            # Save as csv with tab seperators 

In [None]:
# Working with Exel with .to_excel and read_excel
# conda install xlwt openpyxl xlrd 
df_big_trade_days.to_excel('output.xlsx')                   # Saving a dataframe to Excel.  Can use sheet arg & row & column 
df_excel = pd.read_excel('output.xlsx' , index_col='Date')  # Loading in from Excel
df_excel

In [None]:
# Working with json with .to_json and read_json
df_big_trade_days.to_json('output.json', orient='records', lines=True)
# Make records /list like rather than dictionary like with: orient='records' 
# Make each record a new line with lines=True' 
df_json = pd.read_json('output.json', orient='records', lines=True)
df_json

In [None]:
# Working with SQL
# Set up database 
# from sqlalchemy import create_engine
# import psycopg2
# engine = create_engine('postgresql://dbuser:dbpass@localhost:5432/sample_db')
# df.to_sql('sample_table', engine, if_exists='replace')
# sql_df = pd.read_sql('sample_table', engine, index_col='Respondent')
# sql_df = pd.read_sql_query('SELECT * FROM sample_table', engine, index_col='Respondent')
# # sql_df.head()


In [None]:
# Can read directly from a URL
posts_df = pd.read_json('https://raw.githubusercontent.com/CoreyMSchafer/code_snippets/master/Python/Flask_Blog/snippets/posts.json')
posts_df.head()

In [None]:
posts_df.head()