# Review for Section of Course

In [1]:
%%html
<style>
table {float:left}
td {text-align:left}
</style>

## 01 Understanding Your Data

|Python Code|Notes/Explanation|
| :---      | :---            |
|__df.dtypes__|View the data types of columns|
|__df['year'].astype(float)__|Convert column to a new type|
|__df['year] = df['year'].astype(float)__|Assign converted type back to dataframe|
|__pd.to_numeric(df.height, errors="coerce")__|Coerse conversion errors - conversion errors are set to 'nan' |
|__df['year'].min()__|Call aggregation function on series|
|__df.agg(['min','max','mean','std'])__|Use .agg to call multiple functions on dataframe|
|__df['height'].transform(lambda x: x / 10)__|Transform a column of data using custom function|
|__df.groupby('artist').transform('nunique')__|Transform a column using built-in function|
|__df.filter(items=['id','artist'])__|View only certain columns|
|__df.filter(regex="(?i)year")__|View columns that match a regex, defaults to case sensitive ((?i) indicates case insensitive)|
|__df.filter(axis=0, like='100', case=False)__|Switch the axis to filter rows based on index (not on column content)|

## 02 Drop and Rename columns in DataFrame

|Python Code|Notes/Explanation|
| :---      | :---            |
|__df.drop('id',axis=1)__|Drop a single column|
|__df.drop(columns=['height','width'])__|Drop multiple columns|
|__df.drop('id',axis=1,inplace=True)__|Drop a column inplace (default is inplace=False, which returns new dataframe)|
|__df=pd.read_csv('file.csv', usecols=['artist','title'])__|Only import certain columns (by name)|
|__df.columns.str.lower()__|Generate a list of new column names using str function lower|
|__[x.uppper() for x in df.columns]__|Generate a list of new column names using a list comprehension|
|__map(lambda x: x.lower(), df.columns)__|Use map to permanently change the column names|
|__df.rename(columns={'start':'finish'})__|Rename a column, returns new dataframe (original df is unchanged)|
|__df.rename(columns=lambda x: x.uppper(), inplace=True)__|Use rename with lambda and inplace to alter existing df|
|__df=pd.read_csv('file.csv',names[col_one','col_two'],header=0)__|Rename columns as df read in, replace existing header row|

## 03 Indexing and Filtering Datasets

|Python Code|Notes/Explanation|
| :---      | :---            |
|__df['col_name']__|Access a column as a pandas series|
|__df['col_name'][1]__|Acces a single row on a column|
|__df[1:5]__|Access a range of rows with a slice (slice is inclusive/exclusive), returns rows in integer positions 1,2,3,4|
|__df[df['year'] > 1800]__|Use a basic filter|
|__df.loc[ROWS,COLUMNS]__|Basic format of dataframe .loc function|
|__df.loc[0:2,:]__|Access a slice of row using index labels and all column, loc slice is inclusive/inclusive|
|__df.loc[0:2,['col_name1','col_name2]]__|Can use lists for row labels or column names for specific rows/columns|
|__df.loc[df['artist'] == 'Artist Name',:]__|Filter rows using boolean series (combine criteria with & (and), \| (or) and ~ not))|
|__df.iloc[ROWS,COLS]__|Like .loc, except uses integer positioning instead of row labels and column names|
|__df.iloc[0:2,:]__|slice uses integer position and is inclusive/exclusive, so returns rows at integer postion 0,1 (not 2)|
|__df.iloc[[1,5],[12,100]]__|Like .loc, can use lists to define rows and columns|
|__df['col_name'].str.contains('search')__|Generate pandas series of boolean based on search|
|__df.loc[df['col_name'].str.contains('search')]__|Filter dataframe using str.contains|
|__df.loc[df['col_name'].str.contains('search1\|search2',case=False,regex=True)]__|Filtering using case insensitive and regular expression|
|__df.loc[df['col_name'].astype(str).contains('search',na=False)]__|Convert to string and ignore any 'nan' rows|

### 04 Handling Bad, Missing and Duplicate Data

|Python Code|Notes/Explanation|
| :---      | :---            |
|__df['title'].str.strip()__|Strip whitespace from entire column|
|__df['title'].transform(lambda x: x.strip())__|Strip whitespace using lambda for greater flexibility|
|__df.replace({'col_name':{'value':nan})__|Import nan from numpy and replace all specific values with nan|
|__implace=True__|In many situations, use inplace=True to change original data (modify source)|
|__df.loc[df['col_name']=='value', ['col_name']]=nan__|Replace specific values in a column with NaN|
|__df.fillna(-1)__|Fill all NaN values in entire dataframe with -1|
|__df.fillna(value={'col':0})__|Fill NaN values in a specific column|
|__df.dropna()__|Drop rows with ANY NaN values, equivalent to df.dropna(how='any')|
|__df.dropna(how='all')__|Drop rows with ALL NaN values|
|__df.dropna(thresh=15)__|Drop rows with at least 15 NaN values|
|__df.dropna(subset=['col_1','col_2'],inplace=True)__|Drop rows based on NaN values in specific set of columns|
|__df.drop_duplicates()__|Drop all duplicates|
|__df.drop_duplicates(subset=['col_1','col_2'])__|Drop duplicates if they match against a subset of columns|
|__df.drop_duplicates(keep=False)__|Specify which rows to keep: 'first', 'last' or False (don't keep any duplicates)|
|__df.loc[df.duplicated(subset=['col1','col2'],keep=False)]__|Find and see duplicates using .loc across specific columns|