# Python for Padawans

This tutorial will go throughthe basic data wrangling workflow I'm sure you all love to hate, in Python! 
FYI: I come from a R background (aka I'm not a proper programmer) so if you see any formatting issues please cut me a bit of slack. 

**The aim for this post is to show people how to easily move their R workflows to Python (especially pandas/scikit)**

One thing I especially like is how consistent all the functions are. You don't need to switch up style like you have to when you move from base R to dplyr etc. 
|
And also, it's apparently much easier to push code to production using Python than R. So there's that. 

Without further ado lets load all our packages

In [None]:
%matplotlib inline
import os
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math

#### Don't forget that %matplotlib function. Otherwise your graphs will pop up in separate windows and stop the execution of further cells. And nobody got time for that.

In [None]:
data = pd.read_csv('../input/loan.csv', low_memory=False)
data.drop(['id', 'member_id', 'emp_title'], axis=1, inplace=True)

data.replace('n/a', np.nan,inplace=True)
data.emp_length.fillna(value=0,inplace=True)
data['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True)
data['emp_length'] = data['emp_length'].astype(int)

Now let's make some pretty graphs. Coming from R I definitely prefer ggplot2 but the more I use Seaborn, the more I like it. If you kinda forget about adding "+" to your graphs and instead use the dot operator, it does essentially the same stuff.

**And I've just found out that you can create your own style sheets to make life easier. Wahoo!**

But anyway, below I'll show you how to format a decent looking Seaborn graph, as well as how to summarise a given dataframe.

In [None]:
import seaborn as sns
import matplotlib

s = pd.value_counts(data['emp_length']).to_frame().reset_index()
s.columns = ['type', 'count']

def emp_dur_graph(graph_title):

    sns.set_style("whitegrid")
    ax = sns.barplot(y = "count", x = 'type', data=s)
    ax.set(xlabel = '', ylabel = '', title = graph_title)
    ax.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
    _ = ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
    
emp_dur_graph('Distribution of employment length for issued loans')

Now before we move on, we'll look at using style sheets to customize our graphs nice and quickly.

In [None]:
import seaborn as sns
import matplotlib

print (plt.style.available)

Now you can see that we've got quite a few to play with. I'm going to focus on the following styles:

- fivethirtyeight (because it's my fav website)
- seaborn-notebook
- ggplot
- classic

In [None]:
import seaborn as sns
import matplotlib

plt.style.use('fivethirtyeight')
ax = emp_dur_graph('Fivethirty eight style')

In [None]:
plt.style.use('seaborn-notebook')
ax = emp_dur_graph('Seaborn-notebook style')

In [None]:
plt.style.use('ggplot')
ax = emp_dur_graph('ggplot style')

In [None]:
plt.style.use('classic')
ax = emp_dur_graph('classic style')

Now we want to looking at datetimes. Dates can be quite difficult to manipulate but it's worth the wait. Once they're formatted correctly life becomes much easier

In [None]:
import datetime

data.issue_d.fillna(value=np.nan,inplace=True)
issue_d_todate = pd.to_datetime(data.issue_d)
data.issue_d = pd.Series(data.issue_d).str.replace('-2015', '')
data.emp_length.fillna(value=np.nan,inplace=True)

data.drop(['loan_status'],1, inplace=True)

data.drop(['pymnt_plan','url','desc','title' ],1, inplace=True)

data.earliest_cr_line = pd.to_datetime(data.earliest_cr_line)
import datetime as dt
data.earlilest_cr_line_year = data['earliest_cr_line'].dt.year

Now I'll show you how you can build on the above data frame summaries as well as make some facet graphs.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

s = pd.value_counts(data['earliest_cr_line']).to_frame().reset_index()
s.columns = ['date', 'count']

s['year'] = s['date'].dt.year
s['month'] = s['date'].dt.month

d = s[s['year'] > 2008]

plt.rcParams.update(plt.rcParamsDefault)
sns.set_style("whitegrid")

g = sns.FacetGrid(d, col="year")
g = g.map(sns.pointplot, "month", "count")
g.set(xlabel = 'Month', ylabel = '')
axes = plt.gca()
_ = axes.set_ylim([0, d.year.max()])
plt.tight_layout()

Now I want to show you how to easily drop columns that match a given pattern. Let's drop any column that includes "mths" in it.

In [None]:
mths = [s for s in data.columns.values if "mths" in s]
mths

In [None]:
data.drop(mths, axis=1, inplace=True)

Things to be covered in future updates:

1. Using groupby statements with ll their different calcs
2. Handling missing values, going from just mean replacement all the way to k means
3. df.describe
4. Using the apply and vectorised functions
5. Converting a dataframe to a numpy/sklean useful format
6. Running a simple regression model