## Google Data Analyst - [Course 7 - Data Analysis with R Programming](https://www.coursera.org/learn/data-analysis-r/supplement/Y0Vr4/course-syllabus) [[Data Analyst]] #Google-data-analyst-course

### [Week 3 - Data Cleaning](https://www.coursera.org/learn/data-analysis-r/lecture/3FBCt/cleaning-up-with-the-basics)

**[Pandas Comparison with R](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html)**

R needs extra packages `Here`, `Skimr`, `Janitor` and `dplyr` to do basic things:

```
install.packages("here") # the "here" package makes referencing files easier (?!)
library("here") # load package

install.packages("skimr") # makes summarizing data easy and let's you skim through more quickly
library("skimr")

install.packages("janitor") # has functions for cleaning data

install.packages("dplyr")
```

### Load Palmer Penguins dataset with Here
```
install.packages("palmerpenguins")
library("palmerpenguins")
```
Then, we can check out the data with:
```
skim_without_charts()
glimpse()
head()
select()
```


In [1]:
# Load penguins dataset in Python
import pandas as pd
import numpy as np

url = "https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv"

In [2]:
df = pd.read_csv(url)

df.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


The `skim_without_charts()` method a lot of info about the dataset, including:
- name
- number of rows and columns
- type, type frenquency
- for each variable type:
  - n_missing
  - complete_rate
  - ordered, n_unique, mean and other statistics

In Python, we can use `info()`, `dtypes`, `describe()` and `select_dtypes()`.


In [3]:
# In Python, we can use `info()`, `dtypes`, `describe()` and `select_dtypes()`.
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 non-null    int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 24.3+ KB


In [4]:
df.dtypes

rowid                  int64
species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
year                   int64
dtype: object

In [5]:
df.describe()

Unnamed: 0,rowid,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,344.0,342.0,342.0,342.0,342.0,344.0
mean,172.5,43.92193,17.15117,200.915205,4201.754386,2008.02907
std,99.448479,5.459584,1.974793,14.061714,801.954536,0.818356
min,1.0,32.1,13.1,172.0,2700.0,2007.0
25%,86.75,39.225,15.6,190.0,3550.0,2007.0
50%,172.5,44.45,17.3,197.0,4050.0,2008.0
75%,258.25,48.5,18.7,213.0,4750.0,2009.0
max,344.0,59.6,21.5,231.0,6300.0,2009.0


In [6]:
df.select_dtypes('object')

Unnamed: 0,species,island,sex
0,Adelie,Torgersen,male
1,Adelie,Torgersen,female
2,Adelie,Torgersen,female
3,Adelie,Torgersen,
4,Adelie,Torgersen,female
...,...,...,...
339,Chinstrap,Dream,male
340,Chinstrap,Dream,female
341,Chinstrap,Dream,male
342,Chinstrap,Dream,male


In [7]:
df.select_dtypes('float64', 'int64')

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
3,,,,
4,36.7,19.3,193.0,3450.0
...,...,...,...,...
339,55.8,19.8,207.0,4000.0
340,43.5,18.1,202.0,3400.0
341,49.6,18.2,193.0,3775.0
342,50.8,19.0,210.0,4100.0


`glimpse(penguins)` and `head(penguins)` is used in R to glimpse the first few rows of data.

In [8]:
df.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In R we use `select` to see specific columns like this:
```
penguins %>%
  select(-species)
```
This shows all the columns except species!

In [9]:
print(df['species']) # this returns a Series. You probably should use [['x']] to get DataFrame


0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
         ...    
339    Chinstrap
340    Chinstrap
341    Chinstrap
342    Chinstrap
343    Chinstrap
Name: species, Length: 344, dtype: object


In [10]:
df[['species', 'island', 'sex']]

Unnamed: 0,species,island,sex
0,Adelie,Torgersen,male
1,Adelie,Torgersen,female
2,Adelie,Torgersen,female
3,Adelie,Torgersen,
4,Adelie,Torgersen,female
...,...,...,...
339,Chinstrap,Dream,male
340,Chinstrap,Dream,female
341,Chinstrap,Dream,male
342,Chinstrap,Dream,male


In R, we use the `rename` to change column names, and `rename_with` to adjust name styles. The `clean_names` method in the `janitor` package automatically make column names unique and consistent.
```
penguins %>%
  rename(island_new=island)

rename_with(penguins, tolower)

clean_names(penguins)
```


In [11]:
# Pandas also has a `rename` method
df.rename(columns = {"species": "Species",
                     "island": "location"}, inplace=True)
df

Unnamed: 0,rowid,Species,location,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,341,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


There's no easy `clean_names` method like in R, but you can easily do a string manipulation to lower case and replace all spaces with underscores.

In [12]:
df.columns=df.columns.str.lower().str.replace(' ','_')
df

Unnamed: 0,rowid,species,location,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,341,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


In [13]:
# Or the same thing but use the `rename` method
df=df.rename(columns=lambda x:x.lower().replace(' ','_'))
df

Unnamed: 0,rowid,species,location,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...,...
339,340,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,341,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,342,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,343,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009
