# This notebook will be used to cleaning and tidy up the dataset 

### Import Libaries for cleaning the dataset.

In [61]:
import pandas as pd
import numpy as np

In [62]:
# reading the dataset csv file using one of pandas data reads to store the set in a reusable variable.
# Along with displaying a quick look into what the data appears to be with the .head() function. 
ABB = pd.read_csv("D:\AB_NYC_2019.csv")
ABB.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [63]:
# Using pandas .info() function to show the size, characteristics, and fields within the dataset I am using today. 
ABB.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

### As you can see above the Airbnb contains exactually 48895 rows and 16 columns of these rows and columns you will see that there are 3 float fields, 7 integer fields, and 6 object/string fields. 

### You can see here that the dataset ranges from name and locations of the Airbnb rentals to the rooming, pricing, and availability.  

In [65]:
# code below shows me checking my dataset to see which columns contains nan values and to see how much each column reports. 
ABB.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [73]:
# Printing out the total number of nan values found in the dataset.
print('For the entire dataset we have found that there were 20141 nan values in the entire set.')
# Using the dropna() function to remove the data that contains nan values in the rows of the dataset. 
ABB.dropna(inplace=True)
print('Here is the total number of nan values after running the dropna() function',ABB.isnull().sum().sum(),'nan values were found .')

For the entire dataset we have found that there were 20141 nan values in the entire set.
Here is the total number of nan values after running the dropna() function 0 nan values were found .


### After searching the dataset for nan values it was found that there were ~20K records that either contains some or all nan values. Later on in the modeling notebook I will use a clean dataset verse a cleaned dataset to compare the accuracy of the individual regression scores. 

In [76]:
ABB.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38821 entries, 0 to 48852
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              38821 non-null  int64  
 1   name                            38821 non-null  object 
 2   host_id                         38821 non-null  int64  
 3   host_name                       38821 non-null  object 
 4   neighbourhood_group             38821 non-null  object 
 5   neighbourhood                   38821 non-null  object 
 6   latitude                        38821 non-null  float64
 7   longitude                       38821 non-null  float64
 8   room_type                       38821 non-null  object 
 9   price                           38821 non-null  int64  
 10  minimum_nights                  38821 non-null  int64  
 11  number_of_reviews               38821 non-null  int64  
 12  last_review                     

In [84]:
# Saving only the fields that I will be using to report on my model for future regression and analyzing  
Cleaned_ABB = ABB[['neighbourhood_group','room_type','price','minimum_nights','number_of_reviews','availability_365']]

In [85]:
Cleaned_ABB.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38821 entries, 0 to 48852
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   neighbourhood_group  38821 non-null  object
 1   room_type            38821 non-null  object
 2   price                38821 non-null  int64 
 3   minimum_nights       38821 non-null  int64 
 4   number_of_reviews    38821 non-null  int64 
 5   availability_365     38821 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 1.8+ MB


In [86]:
Cleaned_ABB.head()

Unnamed: 0,neighbourhood_group,room_type,price,minimum_nights,number_of_reviews,availability_365
0,Brooklyn,Private room,149,1,9,365
1,Manhattan,Entire home/apt,225,1,45,355
3,Brooklyn,Entire home/apt,89,1,270,194
4,Manhattan,Entire home/apt,80,10,9,0
5,Manhattan,Entire home/apt,200,3,74,129


### Removed all uncessary fields as I am going to create the model based on the New York districts and room types that are available to rent. 

In [87]:
# Exported the newly cleaned dataset and will use this data set for reporting and model creation
ABB.to_csv (r'D:\Cleaned_data.csv')