# Exploratory Data Analysis: R and Python

## Libraries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns 

In [2]:
# load rpy2 to work on the same Jupyter Notebook
%load_ext rpy2.ipython

# hide possible warning
import warnings
warnings.filterwarnings('ignore')

In [3]:
%%R 

library(tidyverse)

## Importing Data

The current working directory.

In [None]:
pwd

In [None]:
%%R
getwd()

Importing the file with pandas.

In [6]:
filename = "datasets/iris.csv"
df_p = pd.read_csv(filename)

type(df_p)

pandas.core.frame.DataFrame

Importing the file with readr.

In [7]:
%%R

filename = "datasets/iris.csv"
df_r <- read_csv(filename) 

class(df_r)

[1] "tbl_df"     "tbl"        "data.frame"


We can also pass a data frame from Python to R cell.

In [8]:
# Take df_p and assign it to an R variable of the same name
%R -i df_p

In [9]:
%%R

class(df_p)

[1] "data.frame"


## Exploring the Data

### Dimension, size, column names

In [10]:
df_p.shape

(150, 5)

In [11]:
df_p.columns

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

In [12]:
%%R
dim(df_r)

[1] 150   5


In [13]:
%%R
names(df_r)

[1] "sepal.length" "sepal.width"  "petal.length" "petal.width"  "variety"     


In [14]:
### Head, tail

In [15]:
df_p.head(2)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa


In [16]:
%%R

head(df_r)

[90m# A tibble: 6 x 5[39m
  sepal.length sepal.width petal.length petal.width variety
         [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m  
[90m1[39m          5.1         3.5          1.4         0.2 Setosa 
[90m2[39m          4.9         3            1.4         0.2 Setosa 
[90m3[39m          4.7         3.2          1.3         0.2 Setosa 
[90m4[39m          4.6         3.1          1.5         0.2 Setosa 
[90m5[39m          5           3.6          1.4         0.2 Setosa 
[90m6[39m          5.4         3.9          1.7         0.4 Setosa 


Let's check how does the data frame that we have passed from Python to R look like.

In [17]:
%%R

head(df_p)

  sepal.length sepal.width petal.length petal.width variety
0          5.1         3.5          1.4         0.2  Setosa
1          4.9         3.0          1.4         0.2  Setosa
2          4.7         3.2          1.3         0.2  Setosa
3          4.6         3.1          1.5         0.2  Setosa
4          5.0         3.6          1.4         0.2  Setosa
5          5.4         3.9          1.7         0.4  Setosa


In [18]:
### Summary Statistics

In [19]:
df_p.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal.length    150 non-null float64
sepal.width     150 non-null float64
petal.length    150 non-null float64
petal.width     150 non-null float64
variety         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


In [20]:
%%R
str(df_r)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	150 obs. of  5 variables:
 $ sepal.length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ sepal.width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal.length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal.width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ variety     : chr  "Setosa" "Setosa" "Setosa" "Setosa" ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 5
  .. ..$ sepal.length: list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ sepal.width : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ petal.length: list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ petal.width : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ variety     : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "col

In [21]:
df_p.describe()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [22]:
%%R
summary(df_r)

  sepal.length    sepal.width     petal.length    petal.width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
   variety         
 Length:150        
 Class :character  
 Mode  :character  
                   
                   
                   


To be continued...