<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Purpose" data-toc-modified-id="Purpose-1">Purpose</a></span><ul class="toc-item"><li><span><a href="#Reading-in-Data" data-toc-modified-id="Reading-in-Data-1.1">Reading in Data</a></span></li><li><span><a href="#Writing-out-Data" data-toc-modified-id="Writing-out-Data-1.2">Writing out Data</a></span></li><li><span><a href="#Creating/Converting-to-a-Dataframe" data-toc-modified-id="Creating/Converting-to-a-Dataframe-1.3">Creating/Converting to a Dataframe</a></span><ul class="toc-item"><li><span><a href="#Pandas-Series-to-Dataframe" data-toc-modified-id="Pandas-Series-to-Dataframe-1.3.1">Pandas Series to Dataframe</a></span></li></ul></li><li><span><a href="#Selection-of-Data" data-toc-modified-id="Selection-of-Data-1.4">Selection of Data</a></span></li><li><span><a href="#Actions-on-Data" data-toc-modified-id="Actions-on-Data-1.5">Actions on Data</a></span></li><li><span><a href="#Summarizing-a-Dataframe" data-toc-modified-id="Summarizing-a-Dataframe-1.6">Summarizing a Dataframe</a></span><ul class="toc-item"><li><span><a href="#Descriptive-Statistics-of-a-dataframe" data-toc-modified-id="Descriptive-Statistics-of-a-dataframe-1.6.1">Descriptive Statistics of a dataframe</a></span></li><li><span><a href="#Datatypes-of-each-column-in-a-dataframe" data-toc-modified-id="Datatypes-of-each-column-in-a-dataframe-1.6.2">Datatypes of each column in a dataframe</a></span></li></ul></li><li><span><a href="#Miscellaneous" data-toc-modified-id="Miscellaneous-1.7">Miscellaneous</a></span><ul class="toc-item"><li><span><a href="#Difference-between-copying-and-referencing-your-data" data-toc-modified-id="Difference-between-copying-and-referencing-your-data-1.7.1">Difference between copying and referencing your data</a></span></li></ul></li></ul></li></ul></div>

In [2]:
import pandas as pd
import json
import numpy as np

## Purpose
This post is intended to be an ever growing list of functions that would be handy for any **Data Scientist** to know while wrangling data with pandas. Use the table of contents to navigate to a particular function. 

### Reading in Data
***

### Writing out Data
***

### Creating/Converting to a Dataframe
***

#### Pandas Series to Dataframe
You can use the *to_frame()* function to convert a pandas series to a pandas dataframe. The name attribute can help define the name of the column in th dataframe.

In [58]:
data = pd.read_csv('./data/titanic/titanic.csv')
data.dtypes.to_frame(name = 'data_type')

Unnamed: 0,data_type
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


### Selection of Data
***

### Actions on Data
***

### Summarizing a Dataframe
***

#### Descriptive Statistics of a dataframe
You can use the describe() function to get the descriptive statistics and central tendencies of each variable in a data frame. By default, the describe function will summarize only the numerical variables, but you can alter that using the *include* or *exclude* parameter which allows you to select which data type the describe function works with. 

Full documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html  

In [50]:
data = pd.read_csv('./data/titanic/titanic.csv')
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [51]:
data.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,,891,2,,,,681,,147,3
top,,,,"Simonius-Blumer, Col. Oberst Alfons",male,,,,CA. 2343,,G6,S
freq,,,,1,577,,,,7,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


#### Datatypes of each column in a dataframe
The *dtypes* function allows you to list the data type of each column in the dataframe

In [57]:
data = pd.read_csv('./data/titanic/titanic.csv')
data.dtypes.to_frame(name = 'data_type')

Unnamed: 0,data_type
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


### Miscellaneous
***

#### Difference between copying and referencing your data
There might be several instances wherein you might require to make a copy of your data. It is important to distinguish between making a copy of your data and simply referencing the same data with a different variable name

In [41]:
data = pd.read_csv('./data/titanic/titanic.csv')
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [42]:
# Assigning data to a new variable creates a reference to the original data set
reference_to_data = data

# Using the .copy() function creates a new independent copy of the data set
copy_of_data = data.copy()

In [44]:
# Let's change the name of the first passenger and reassign the value to the 'reference_to_data' variable.
reference_to_data.iloc[0,3] = "XXXXXXXXXXXXXX"

In [45]:
# As we can see, the name has changed in the reference data frame
reference_to_data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,XXXXXXXXXXXXXX,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [48]:
# However the name has also changed in the original data frame, a somewhat unintended consequence
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,XXXXXXXXXXXXXX,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [49]:
# However, the copy of the data framee still has the original name
copy_of_data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
