# Python for R Users
- Author: Sylvia Tran
- Date: September 9, 2019

[GitHub Reference](https://github.com/godsylla/python-for-R-users)

![alt text][python-and-r]

[python-and-r]: https://github.com/godsylla/python-for-R-users/blob/master/assets/python-are-r-friends.png?raw=true "Python"

### Introduction

This notebook is a tutorial on Python for R Users. The intention is to help draw some similarities between Python and R in the hope that those new to Python will find it less initmidating, and overall more approachable.

This is an interactive Python notebook (.ipynb) in the repository that is available to run in Jupyter Notebook (using Anaconda for e.g.) that uses a Python3 kernel. Additionally, there is an attached python script (`python-for-r-users.py`) that contains the code without any of the markdown cells.

The content covered will leverage the following packages:

- [pandas](https://pandas.pydata.org/pandas-docs/stable/) (for data frame shaping, cleaning)
- [numpy](https://docs.scipy.org/doc/numpy/reference/index.html) (for any numeric python required)
- [scikit learn](https://scikit-learn.org/stable/user_guide.html) (for train-test-splitting, feature-scaling, modeling, model metrics)

Should you wish to explore these packages more, please refer to the documentation online. There are ample examples for each as these three are widely used in the field of data science and data analysis. Notably, when using Python, numpy, pandas, or scikit learn for data analysis or data science, please take care to read documentation carefully as there are at times nuanced differences with regard to what is happening in the source code that might result in different than expected behavior when implementing same/similar tasks in R.

### Table of Contents:
  1. Importing Packages
  2. Loading Toy Datasets (sklearn)
      - detour
  3. Cursory Inspection (pandas & numpy)
  4. Light Cleaning (base python, pandas)
  5. Train-test-split (sklearn)
  6. Feature Scaling (sklearn)
  7. Model (sklearn)
  8. Model Evaluation (sklearn)

#### 1. Importing Packages
![alt text][importing]

[importing]: https://github.com/godsylla/python-for-R-users/blob/master/assets/importing.jpg?raw=true "Importing"

- R: `library('package_name')`
- Python: `import package_name`

If the module does not exist, you can use 
* `pip install package_name`


If you wish to install it only to your ipynb environment, in a code cell: 
* `!pip install package_name`

In [1]:
# The `np` and `pd` nicknames are convention
import numpy as np
import pandas as pd

#### 2. Loading Toy Datasets

* R: typically found in the `datasets` package
* Python: we'll be using toy datasets from `sklearn.datasets` package. 

Since it's a package we'll be pulling our dataset from, we'll be implementing what we just learned (importing packages).

**At the risk of having this kind of effect on the audience... I shall proceed anyway**

![](https://media.giphy.com/media/ekvjSltNbJFtRG4V22/giphy.gif)

But hey! It's an INTRO, so you all have to suffer this with me

In [2]:
from sklearn.datasets import load_boston

boston = load_boston()
print('boston: ', type(boston))            # R: print(class(boston))

# To access the data for a sklearn toy dataset, we need to add `.data` behind the loaded data
# Simultaneously convert this to a pandas DataFrame
boston_df = pd.DataFrame(boston.data)
print('boston_df: ', type(boston_df))

boston:  <class 'sklearn.utils.Bunch'>
boston_df:  <class 'pandas.core.frame.DataFrame'>


In [3]:
# With toy datasets, you can print the description based on the sklearn.utils.Bunch object we initially loaded
boston.DESCR



In [4]:
display(boston.feature_names)

# index slicing in Python uses hard brackets, 
# INDEXING BEGINS WITH 0, NOT 1
display(boston.target[:10])

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9])

In [5]:
boston_df.shape

(506, 13)

#### 3. Cursory Data Inspection

**Let's begin!**
![](https://media.giphy.com/media/3ohs4ruO9hBMDRbOne/giphy.gif)


**NOTE**
- Since `boston_df` is a pandas.core.frame.DataFrame object, most of what follows will be a demo of built in `pandas` functions. 

In [6]:
# Inspect the dataframe 
display(boston_df.shape)

# Show top 5 (.tail() for last 5),
# Use .sample() to randomly select 1 sample 
# Can also specify #s in b/t the parens
display(boston_df.head())

# Rename the columns using the feature_names:
# zip is a built-in Python function, make it your friend
boston_df.rename(columns=dict(zip(boston_df.columns, boston.feature_names)), inplace=True)
display(boston_df.head())

(506, 13)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [8]:
# Get a sense of the types & null count
display(boston_df.info())

# Get summary statistics on the dataframe
print(boston_df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB


None

             CRIM          ZN       INDUS        CHAS         NOX          RM  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean     3.593761   11.363636   11.136779    0.069170    0.554695    6.284634   
std      8.596783   23.322453    6.860353    0.253994    0.115878    0.702617   
min      0.006320    0.000000    0.460000    0.000000    0.385000    3.561000   
25%      0.082045    0.000000    5.190000    0.000000    0.449000    5.885500   
50%      0.256510    0.000000    9.690000    0.000000    0.538000    6.208500   
75%      3.647423   12.500000   18.100000    0.000000    0.624000    6.623500   
max     88.976200  100.000000   27.740000    1.000000    0.871000    8.780000   

              AGE         DIS         RAD         TAX     PTRATIO           B  \
count  506.000000  506.000000  506.000000  506.000000  506.000000  506.000000   
mean    68.574901    3.795043    9.549407  408.237154   18.455534  356.674032   
std     28.148861    2.1057

### We're going to take a detour
![alt text][detour]

[det]: https://github.com/godsylla/python-for-R-users/blob/master/assets/importing.jpg?raw=true "Importing"




#### Additional Resources can be found here:

- [Numpy Tutorial](http://cs231n.github.io/python-numpy-tutorial/)
- [Scikit Learn - More Documentation](https://scikit-learn.org/stable/index.html)