# Python for R Users
- Author: Sylvia Tran
- Date: September 9, 2019

[GitHub Reference](https://github.com/godsylla/python-for-R-users)

![alt text][python-and-r]

[python-and-r]: https://github.com/godsylla/python-for-R-users/blob/master/assets/python-are-r-friends.png?raw=true "Python"

### Introduction

This notebook is a tutorial on Python for R Users. The intention is to help draw some similarities between Python and R in the hope that those new to Python will find it less initmidating, and overall more approachable.

This is an interactive Python notebook (.ipynb) in the repository that is available to run in Jupyter Notebook (using Anaconda for e.g.) that uses a Python3 kernel. Additionally, there is an attached python script (`python-for-r-users.py`) that contains the code without any of the markdown cells.

The content covered will leverage the following packages:

- [pandas](https://pandas.pydata.org/pandas-docs/stable/) (for data frame shaping, cleaning)
- [numpy](https://docs.scipy.org/doc/numpy/reference/index.html) (for any numeric python required)
- [scikit learn](https://scikit-learn.org/stable/user_guide.html) (for train-test-splitting, feature-scaling, modeling, model metrics)

Should you wish to explore these packages more, please refer to the documentation online. There are ample examples for each as these three are widely used in the field of data science and data analysis. Notably, when using Python, numpy, pandas, or scikit learn for data analysis or data science, please take care to read documentation carefully as there are at times nuanced differences with regard to what is happening in the source code that might result in different than expected behavior when implementing same/similar tasks in R.

### Table of Contents:
  1. Importing Packages
  2. Loading Toy Datasets (sklearn)
      - detour
  3. Cursory Inspection (pandas & numpy)
  4. Light Cleaning (base python, pandas)
  5. Train-test-split (sklearn)
  6. Feature Scaling (sklearn)
  7. Model (sklearn)
  8. Model Evaluation (sklearn)

### 1. Importing Packages
![alt text][importing]

[importing]: https://github.com/godsylla/python-for-R-users/blob/master/assets/importing.jpg?raw=true "Importing"

- R: `library('package_name')`
- Python: `import package_name`

If the module does not exist, you can use 
* `pip install package_name`


If you wish to install it only to your ipynb environment, in a code cell: 
* `!pip install package_name`

In [1]:
# The `np` and `pd` nicknames are convention
import numpy as np
import pandas as pd

### 2. Loading Toy Datasets

* R: typically found in the `datasets` package
* Python: we'll be using toy datasets from `sklearn.datasets` package. 

Since it's a package we'll be pulling our dataset from, we'll be implementing what we just learned (importing packages).

**At the risk of having this kind of effect on the audience... I shall proceed anyway**

![](https://media.giphy.com/media/ekvjSltNbJFtRG4V22/giphy.gif)

But hey! It's an INTRO, so you all have to suffer this with me

In [2]:
from sklearn.datasets import load_boston

boston = load_boston()
print('boston: ', type(boston))            # R: print(class(boston))

# To access the data for a sklearn toy dataset, we need to add `.data` 
# Simultaneously convert this to a pandas DataFrame
boston_df = pd.DataFrame(boston.data)
print('boston_df: ', type(boston_df))

boston:  <class 'sklearn.utils.Bunch'>
boston_df:  <class 'pandas.core.frame.DataFrame'>


In [3]:
# With toy datasets, you can print the description based on the sklearn.utils.Bunch object 
boston.DESCR



In [4]:
# Get a sense of the types & null count
display(boston_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
0     506 non-null float64
1     506 non-null float64
2     506 non-null float64
3     506 non-null float64
4     506 non-null float64
5     506 non-null float64
6     506 non-null float64
7     506 non-null float64
8     506 non-null float64
9     506 non-null float64
10    506 non-null float64
11    506 non-null float64
12    506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB


None

In [5]:
# Inspect the dataframe 
display(boston_df.shape)

# Show top 5 (.tail() for last 5),
# Use .sample() to randomly select 1 sample 
# Can also specify #s in b/t the parens
display(boston_df.head())

# Rename the columns using the feature_names:
# zip is a built-in Python function, make it your friend
boston_df.rename(columns=dict(zip(boston_df.columns, boston.feature_names)), inplace=True)
display(boston_df.head())

# Get the target for your regression model
boston_y = boston.target

(506, 13)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [6]:
# Get summary statistics on the dataframe
print(boston_df.describe())

             CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.593761   11.363636   11.136779    0.069170    0.554695    6.284634   
std      8.596783   23.322453    6.860353    0.253994    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.647423   12.500000   18.100000    0.000000    0.624000    6.623500   
max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000   

              AGE         DIS         RAD         TAX     PTRATIO           B  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   
std     28.148861    2.1057

### We're going to take a detour because there are slightly more interesting data sets with which to demo
![alt text][detour]

[detour]: https://github.com/godsylla/python-for-R-users/blob/master/assets/detour.jpg?raw=true "Detour"


#### More specifically, we're going to use a slightly more popular dataset

![](https://media.giphy.com/media/PqMTE3jpTKxXi/giphy.gif)

### 2. Read in the data using `pandas`

#### B E E R
![](https://media.giphy.com/media/3og0IUU8wsktr7quoU/giphy.gif)

**About the data:** 
https://www.kaggle.com/nickhould/craft-cans

| Alpha v. Numeric        | Column Name           | Comment  |
| ------------- |:-------------:| -----:|
| #      | abv | The alcoholic content by volume with 0 being no alcohol and 1 being pure alcohol |
| #      | ibu      | ibuInternational bittering units, which describe how bitter a drink is |
| # | id      | Unique ID |
| A      | name | name of the beer |
| A      | style      | Beer style (lager, ale, IPA, etc.) |
| # | brewery id      | Unique identifier for brewery that produces this beer; can use to join with brewery info |
| # | ounces      | size of beer in ounces |


In [7]:
# READ IN THE DATA

# Unzip the zip file
from zipfile import ZipFile

# Specify the directory where the zipfile is saved
# Read in the zip file `r`
with ZipFile('../data/craft-cans.zip', 'r') as zipObject:
    # specify the directory where you want the unzipped files to be
    zipObject.extractall('../data/')
    
# Use pandas to read in the csv files
beers_df = pd.read_csv('../data/beers.csv')

### 3. Cursory Data Inspection

**Let's begin!**
![](https://media.giphy.com/media/3ohs4ruO9hBMDRbOne/giphy.gif)

In [11]:
# CURSORY INSPECTION USING BUILT-IN PANDAS FUNCTIONS ON THE DATAFRAME OBJECT

# defaults to top 5, can specify a #
# .tail()
# .sample() defaults to selecting a random row, unless a different # is specified
display(beers_df.head())

# null value count & types
display(beers_df.info())

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces
0,0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,1,0.066,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,2,0.071,,2264,Rise of the Phoenix,American IPA,177,12.0
3,3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0
4,4,0.075,,2262,Sex and Candy,American IPA,177,12.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2410 entries, 0 to 2409
Data columns (total 8 columns):
Unnamed: 0    2410 non-null int64
abv           2348 non-null float64
ibu           1405 non-null float64
id            2410 non-null int64
name          2410 non-null object
style         2405 non-null object
brewery_id    2410 non-null int64
ounces        2410 non-null float64
dtypes: float64(3), int64(3), object(2)
memory usage: 150.7+ KB


None

In [10]:
# summary stats for numeric columns only
display(beers_df.describe())

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,brewery_id,ounces
count,2410.0,2348.0,1405.0,2410.0,2410.0,2410.0
mean,1204.5,0.059773,42.713167,1431.113278,231.749793,13.592241
std,695.851397,0.013542,25.954066,752.459975,157.685604,2.352204
min,0.0,0.001,4.0,1.0,0.0,8.4
25%,602.25,0.05,21.0,808.25,93.0,12.0
50%,1204.5,0.056,35.0,1453.5,205.0,12.0
75%,1806.75,0.067,64.0,2075.75,366.0,16.0
max,2409.0,0.128,138.0,2692.0,557.0,32.0


### 4. Light Data Cleaning


#### Additional Resources can be found here:

- [Numpy Tutorial](http://cs231n.github.io/python-numpy-tutorial/)
- [Scikit Learn - More Documentation](https://scikit-learn.org/stable/index.html)