## Introduction to Data Wrangling and Tidying

### Data Wrangling

Let’s look at a subset of restaurant inspections from the New York City Department of Health and Mental Hygiene (NYC DOHMH) to work through some data wrangling processes. The data includes seven different columns with information about a restaurant’s location and health inspection. Here is a description of the dataset’s variables.
```
Pos.    Var. Name                 Var. Description
-----   --------------------      --------------------------------------
0       DBA                       Restaurant name
1       BORO                      Borough
2       CUISINE DESCRIPTION       Type of cuisine
3       GRADE                     Letter grade
4       LATITUDE                  Latitude coordinates of restaurant 
5       LONGITUDE                 Longitude coordinates of restaurant 
6       URL                       URL link to restaurant's website
```

Let’s use the read_csv() function in pandas to load our dataset as a pandas dataframe and take a look at the first 10 rows out of the 27 total.

In [3]:
import pandas as pd
import numpy as np

restaurants = pd.read_csv('/home/oldoc/largeDataSets/DOHMH_New_York_City_Restaurant_Inspection_Results.csv')

In [4]:
# the .head(10) function will show us the first 10 rows in our dataset
print(restaurants.head(10))

      CAMIS                           DBA           BORO BUILDING  \
0  40511702             NOTARO RESTAURANT      MANHATTAN      635   
1  40511702             NOTARO RESTAURANT      MANHATTAN      635   
2  50046354                      VITE BAR         QUEENS     2507   
3  50061389       TACK'S CHINESE TAKE OUT  STATEN ISLAND      11C   
4  41516263                    NO QUARTER       BROOKLYN     8015   
5  50015855               KABAB HOUSE NYC         QUEENS     4339   
6  50058069              HENRI'S BACKYARD       BROOKLYN      256   
7  40807238  RICHMOND COUNTY COUNTRY CLUB  STATEN ISLAND     1122   
8  41547684                  PLANET WINGS  STATEN ISLAND      480   
9  40376944                   TOMOE SUSHI      MANHATTAN      172   

            STREET  ZIPCODE       PHONE CUISINE DESCRIPTION INSPECTION DATE  \
0    SECOND AVENUE  10016.0  2126863400             Italian      06/15/2015   
1    SECOND AVENUE  10016.0  2126863400             Italian      11/25/2014   
2  

In [5]:
restaurants.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399918 entries, 0 to 399917
Data columns (total 18 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   CAMIS                  399918 non-null  int64  
 1   DBA                    399559 non-null  object 
 2   BORO                   399918 non-null  object 
 3   BUILDING               399809 non-null  object 
 4   STREET                 399909 non-null  object 
 5   ZIPCODE                399909 non-null  float64
 6   PHONE                  399913 non-null  object 
 7   CUISINE DESCRIPTION    399918 non-null  object 
 8   INSPECTION DATE        399918 non-null  object 
 9   ACTION                 398783 non-null  object 
 10  VIOLATION CODE         393414 non-null  object 
 11  VIOLATION DESCRIPTION  392939 non-null  object 
 12  CRITICAL FLAG          399918 non-null  object 
 13  SCORE                  376704 non-null  float64
 14  GRADE                  195413 non-nu

In [6]:
# the .shape method in pandas identifies the number of rows and columns in our dataset as (rows, columns)
restaurants.shape

(399918, 18)

When we look closely at the table, we see some missing data. In both GRADE and URL columns, we have missing values marked as NaNs. In the Latitude and Longitude columns, we have a missing set of coordinates marked as (0.000, 0.000) for IHOP. (0.000, 0.000) is the label for missing coordinates because no restaurants in New York City are at the equator. Other common indicators used for missing values are values that are NA and -.

### Preliminary data cleaning

There are also duplicate rows for the restaurant labeled Seamore’s. To remove any duplicate rows, we can use the drop_duplicates() function

In [7]:
# the .drop_duplicates() function removes duplicate rows
restaurants = restaurants.drop_duplicates()

In [8]:
# the .head(10) function will show us the first 10 rows in our dataset
print(restaurants.head(10))

      CAMIS                           DBA           BORO BUILDING  \
0  40511702             NOTARO RESTAURANT      MANHATTAN      635   
1  40511702             NOTARO RESTAURANT      MANHATTAN      635   
2  50046354                      VITE BAR         QUEENS     2507   
3  50061389       TACK'S CHINESE TAKE OUT  STATEN ISLAND      11C   
4  41516263                    NO QUARTER       BROOKLYN     8015   
5  50015855               KABAB HOUSE NYC         QUEENS     4339   
6  50058069              HENRI'S BACKYARD       BROOKLYN      256   
7  40807238  RICHMOND COUNTY COUNTRY CLUB  STATEN ISLAND     1122   
8  41547684                  PLANET WINGS  STATEN ISLAND      480   
9  40376944                   TOMOE SUSHI      MANHATTAN      172   

            STREET  ZIPCODE       PHONE CUISINE DESCRIPTION INSPECTION DATE  \
0    SECOND AVENUE  10016.0  2126863400             Italian      06/15/2015   
1    SECOND AVENUE  10016.0  2126863400             Italian      11/25/2014   
2  

In [9]:
# the .shape method in pandas identifies the number of rows and columns in our dataset as (rows, columns)
restaurants.shape

(399907, 18)

You may have noticed that the column of the dataset called DBA, but we know it is a column with restaurant names. We can use the rename() function and a dictionary to relabel our columns. While we are renaming our columns, we might also want to shorten the cuisine description column to just cuisine. 

In [10]:
# axis=1` refers to the columns, `axis=0` would refer to the rows
# In the dictionary the key refers to the original column name and the value refers to the new column name {'oldname1': 'newname1', 'oldname2': 'newname2'}
restaurants = restaurants.rename({'DBA':'NAME', 'CUISINE DESCRIPTION':'CUISINE'}, axis=1)


In [11]:
# the .head(10) function will show us the first 10 rows in our dataset
print(restaurants.head(10))

      CAMIS                          NAME           BORO BUILDING  \
0  40511702             NOTARO RESTAURANT      MANHATTAN      635   
1  40511702             NOTARO RESTAURANT      MANHATTAN      635   
2  50046354                      VITE BAR         QUEENS     2507   
3  50061389       TACK'S CHINESE TAKE OUT  STATEN ISLAND      11C   
4  41516263                    NO QUARTER       BROOKLYN     8015   
5  50015855               KABAB HOUSE NYC         QUEENS     4339   
6  50058069              HENRI'S BACKYARD       BROOKLYN      256   
7  40807238  RICHMOND COUNTY COUNTRY CLUB  STATEN ISLAND     1122   
8  41547684                  PLANET WINGS  STATEN ISLAND      480   
9  40376944                   TOMOE SUSHI      MANHATTAN      172   

            STREET  ZIPCODE       PHONE    CUISINE INSPECTION DATE  \
0    SECOND AVENUE  10016.0  2126863400    Italian      06/15/2015   
1    SECOND AVENUE  10016.0  2126863400    Italian      11/25/2014   
2         BROADWAY  11106.0  3

### Data Types

Let’s take a look at each column’s data types by appending .dtypes to our pandas dataframe. 

In [12]:
restaurants.dtypes

CAMIS                      int64
NAME                      object
BORO                      object
BUILDING                  object
STREET                    object
ZIPCODE                  float64
PHONE                     object
CUISINE                   object
INSPECTION DATE           object
ACTION                    object
VIOLATION CODE            object
VIOLATION DESCRIPTION     object
CRITICAL FLAG             object
SCORE                    float64
GRADE                     object
GRADE DATE                object
RECORD DATE               object
INSPECTION TYPE           object
dtype: object

We have two types of variables: object and float64. object can consist of both strings or mixed types (both numeric and non-numeric), and float64 are numbers with a floating point (ie. numbers with decimals). There are other data types such as int64 (integer numbers), bool (True/False values), and datetime64 (date and/or time values). 

Since we have both continuous (float64) and categorical (object) variables in our data, it might be informative to look at the number of unique values in each column using the nunique() function.

In [13]:
# .nunique() counts the number of unique values in each column 
restaurants.nunique()

CAMIS                    26505
NAME                     20936
BORO                         6
BUILDING                  7256
STREET                    3328
ZIPCODE                    233
PHONE                    25165
CUISINE                     84
INSPECTION DATE           1414
ACTION                       5
VIOLATION CODE              97
VIOLATION DESCRIPTION       93
CRITICAL FLAG                3
SCORE                      120
GRADE                        6
GRADE DATE                1325
RECORD DATE                  1
INSPECTION TYPE             34
dtype: int64

### Missing Data

From our initial inspection of the data, we know we have missing data in grade, url, latitude, and longitude. Let’s take a look at how the data is missing, also referred to as missingness. To do this we can use isna() to identify if the value is missing. This will give us a boolean and indicate if the observation in that column is missing (True) or not (False). We will also use sum() to count the number of missing values, where isna() returns True. 

In [14]:
# counts the number of missing values in each column 
restaurants.isna().sum() 

CAMIS                         0
NAME                        359
BORO                          0
BUILDING                    109
STREET                        9
ZIPCODE                       9
PHONE                         5
CUISINE                       0
INSPECTION DATE               0
ACTION                     1135
VIOLATION CODE             6504
VIOLATION DESCRIPTION      6974
CRITICAL FLAG                 0
SCORE                     23207
GRADE                    204494
GRADE DATE               207087
RECORD DATE                   0
INSPECTION TYPE            1135
dtype: int64

In [15]:
restaurants_grade = restaurants[['GRADE']]
restaurants_grade.value_counts()

GRADE         
A                 154194
B                  28166
C                   6992
Not Yet Graded      2598
Z                   2104
P                   1359
dtype: int64

In [16]:
restaurants_score = restaurants[['SCORE']]
restaurants_score.describe()

Unnamed: 0,SCORE
count,376700.0
mean,18.910138
std,12.959017
min,-2.0
25%,11.0
50%,15.0
75%,24.0
max,151.0


In [17]:
# here our .where() function replaces latitude values less than 40 with NaN values
restaurants['SCORE'] = restaurants['SCORE'].where(restaurants['SCORE'] < 0, np.nan) 

In [18]:
# counts the number of missing values in each column 
restaurants.isna().sum() 

CAMIS                         0
NAME                        359
BORO                          0
BUILDING                    109
STREET                        9
ZIPCODE                       9
PHONE                         5
CUISINE                       0
INSPECTION DATE               0
ACTION                     1135
VIOLATION CODE             6504
VIOLATION DESCRIPTION      6974
CRITICAL FLAG                 0
SCORE                    399791
GRADE                    204494
GRADE DATE               207087
RECORD DATE                   0
INSPECTION TYPE            1135
dtype: int64

### Characterizing missingness with crosstab

Let's try to understand the missingness in the url column by counting the missing values across each borough. We will use the crosstab() function in pandas to do this.

Te crosstab() computes the frequency of two or more variables. To look at the missingness in the url column we can add isna() to the column to identify if there is an NaN in that column. This will return a boolean, True if there is a NaN and False if there is not. In our crosstab, we will look at all the boroughs present in our data and whether or not they have missing url links.

In [20]:
pd.crosstab(
    # tabulates the boroughs as the index
    restaurants['BORO'],
    # tabulates the number of missing values in the url column as columns
    restaurants['PHONE'].isna(),
    # names the rows
    rownames=['BORO'],
    # names the columns
    colnames=['PHONE NA']
)

PHONE NA,False,True
BORO,Unnamed: 1_level_1,Unnamed: 2_level_1
BRONX,34895,0
BROOKLYN,99592,3
MANHATTAN,159566,0
Missing,9,0
QUEENS,92414,1
STATEN ISLAND,13426,1
