In [None]:
import pandas as pd
    To use pandas

In [None]:
df = pd.DataFrame()

There are some methods and attributes associated with Pandas objects (both DataFrames and series!) which make retrieving information from the data particularly easy. Some commonly used methods:

.head() -------- First x rows
.tail() -------- Last x rows
.info() -------- concise summary of dataframe
.sum()
.value_counts()- tell you how many of each value you've got.

And attributes:

.index --------- access the index or row labels of the DataFrame
.columns ------- access the column labels of the DataFrame
.dtypes -------- returns the data types of all columns in the DataFrame (compare with .info()!)
.shape --------- returns a tuple representing the dimensionality (in (rows, columns) ) of the DataFrame.
.describe()----- generate a descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution
.dtypes() ------ describes type of data


*Aggregation functions*

There are many built-in aggregate methods provided for you in the pandas package, and you can even write and apply your own. Some of the most common aggregate methods you may want to use are:

.min(): returns the minimum value for each column by group
.max(): returns the maximum value for each column by group
.mean(): returns the average value for each column by group
.median(): returns the median value for each column by group
.count(): returns the count of each column by group

.groupby() ----- used to aggregate different groups in a dataframe

You don't have a meaningful way to display information until you chain an Aggregation Function onto the groupby. This allows you to compute summary statistics!

You can quickly use an aggregation function by chaining the call to the end of the .groupby() method, such as:
    df.groupby('Sex').sum()

You can also split data into multiple different levels of groups by passing in an array containing the name of every column you want to group by -- for instance, by every combination of both Sex and Pclass.

df.groupby(['Sex', 'Pclass']).mean()

In [None]:
Concatenation

To perform a concatenation between two or more DataFrames, you pass in an array of the objects to concatenate to the pd.concat() function, as demonstrated below:

to_concat = [df1, df2, df3]
big_df = pd.concat(to_concat)

*Keys and Indexes*
Every table in a database has a column that serves as the Primary Key. In pandas, the index is the primary key for that table. You'll use these keys, along with the Foreign Key, which points to a primary key value in another table, to execute Joins. This allows us to "line up" information from multiple tables and combine them into one table. 

Often, it is useful for us to set a column to act as the index for a DataFrame. To do this, you would type:

some_dataframe.set_index('name_of_index_column', inplace=True)

Note that this will mutate the dataset in place and set the column with the specified name as the index column of the DataFrame. If inplace is not specified it will default to False, meaning that a copy of the DataFrame with the requested changes will be returned, but the original object will remain unchanged.

NOTE: Running cells that make an inplace change more than once will often cause pandas to throw an error. If this happens, just restart the kernel.

By setting the index columns on DataFrames, you make it easy to join DataFrames later on. Note that this is not always feasible, but it's a useful step when possible.

In [3]:
Types of Joins
Joins are always executed between a Left Table and a Right Table. There are four different types of joins you can execute. Consider the following Venn diagrams:
    
When thinking about joins, it is easy to conceptualize them as Venn diagrams.

An Outer Join returns all records from both tables
An Inner Join returns only the records with matching keys in both tables
A Left Join returns all the records from the left table, as well as any records from the right table that have a matching key with a record from the left table
A Right Join returns all the records from the right table, as well as any records from the left table that have a matching key with a record from the right table
DataFrames contain a built-in .join() method. By default, the table calling the .join() method is always the left table. The following code snippet demonstrates how to execute a join in pandas:

SyntaxError: invalid syntax (<ipython-input-3-09b3cd6d32ad>, line 1)

In [2]:
<img src='images/Image_198_joins.png'>

SyntaxError: invalid syntax (<ipython-input-2-1ab246d239f5>, line 1)

![Image_198_joins.png](attachment:Image_198_joins.png)

DataFrames contain a built-in .join() method. By default, the table calling the .join() method is always the left table. The following code snippet demonstrates how to execute a join in pandas:

joined_df = df1.join(df2, how='inner')
Note that to call .join(), you must pass in the right table. You can also set the type of join to perform with the how parameter. The options are 'left', 'right', 'inner', and 'outer'.

If how= is not specified, it defaults to 'left'.

NOTE: If both tables contain columns with the same name, the join will throw an error due to a naming collision, since the resulting table would have multiple columns with the same name. To solve this, pass in a value to lsuffix= or rsuffix=, which will append this suffix to the offending columns to resolve the naming collisions.

*Missing data*

NaNs
By default, pandas represents null values with NaN, which is short for Not a Number. Pandas provides many great ways for checking for null values, built right into DataFrames and Series objects.

Detecting NaNs
df.isna()
Returns a matrix of boolean values, where all cells containing NaN are converted to True, and all cells containing valid data are converted to False.

df.isna().sum()
Since True is equivalent to 1 and False is equivalent to 0 in Python, taking the .sum() of the DataFrame (or Series) will return the total number of NaN values in the dataset. Pandas even breaks this down by column -- see the example output below.

.applymap takes a function as input that it will then apply to every entry in the DataFrame

.map() or .apply() takes a function as input that it will then apply to every entry in the Series