# <font color='#eb3483'> Exploratory Data Analysis </font>


## What is Exploratory Data Analysis?  

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually.   

Plotting in EDA consists of Histograms, Box 
plot, Scatter plot and many more.   

It often takes much time to explore the data. Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.

![image.png](attachment:image.png)

## How to perform Exploratory Data Analysis?  

This is one a question that everyone is keen on knowing the answer to. Well, the answer is it depends on the data set that you are working on. There is no one method or common methods in order to perform EDA. However, there are a few set steps that this generally includes:  

1. Loading and inspecting your data
1. Cleaning the data which includes:  
   2.1. dropping data points and columns we don't need.   
   2.2. checking data types and fixing if needed  
   2.3. removing duplicates  
   2.4. dealing with missing values  
   2.5. looking for outliers and deciding how to deal with these  
   2.6. reformatting columns if needed

1. Some visual exploration to look at relationships between variables or interesting insights that jump out
1. How can you add, change or remove features to get more out of your data? 

# In our notebooks:

In this module we'll be covering classes in:

1. Loading and inspecting your data
1. Cleaning the data which includes:  
   2.1. dropping some columns      
   2.2. removing duplicates     
   2.3. checking data types and fixing if needed       
   2.4. dealing with missing values  
   2.5. looking for outliers and deciding how to deal with these  
1. Cardinality of categorical variables

1. Digging into patterns
1. Pandas profiling



# <font color='#eb3483'> AirBnB Cape Town </font>




### Background


http://insideairbnb.com/get-the-data.html


What can we say about the prices of air bnbs in Cape Town.   
Are there certain neighbourhoods that are more expensive.   
Do properties with higher ratings charge more?  
Does more rooms mean more money? 

This is a great place to start digging in to these questions or generating hypotheses, with data on the price, neighbourhood, layout and ratings per air bnb rental.

![image.png](attachment:image.png)

In [None]:
# Importing the required packages here

import numpy as np
import pandas as pd
import seaborn as sns

from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline

![image.png](attachment:image.png)

## <font color='#eb3483'> 1. Loading and inspecting your data </font>


In [None]:
df = pd.read_csv("data/listings_ct.csv")
df.head()

What other commands can we use to have a high level glance at our data frame?

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.apply('nunique') # this is a useful one

In [None]:
#what are some interesting insights we can pull out just from the above code?

<hr>

![image.png](attachment:image.png)

## <font color='#eb3483'> 2. Clean the data </font>

Some (but potentially not all) steps to take 
2.  Cleaning the data which includes:  
   2.1. dropping columns we don't need  
   2.2. removing duplicates  
   2.3. checking data types and fixing if needed  
   2.4. dealing with missing values  
   2.5. looking for outliers and deciding how to deal with these 



<div>
<img src="attachment:image.png" width="700"/>
</div>

## <font color='#eb3483'> 2.1. Drop columns we dont need </font>


In [None]:
# removing uneeded columns - lets have a browse through see what we can probably remove
df.head()

#use unique command to check out whats in a column if needed
#df.bed_type.unique()

# list them 
#name, summary, host_url, host_name, host_about, minumum_nights, maximum_nights, minimum_minimum_nights, minimum_maximum_nights


In [None]:
# First step is to clean the data and see which are the redundant or unnecessary cols
# drop the columns we listed above that we dont want. 

columns_no_need = ["name", "summary", "host_url", "host_name", "host_about", "minimum_nights", "maximum_nights", "minimum_minimum_nights", "minimum_maximum_nights"]

df = df.drop(columns_no_need, axis=1)


In [None]:
df.head()

In [None]:
# do we waant to rename any columns?
# lets rename neighbourhood_cleansed to neighbourhood

df = df.rename(columns={
    "neighbourhood_cleansed":"neighbourhood",
})
df.head()

<div>
<img src="attachment:image.png" width="700"/>
</div>

## <font color='#eb3483'> 2.2. Removing duplicates </font>


In [None]:
#Let's look for duplicate rows

print(df.shape)

# Rows containing duplicate data
duplicate_rows_df = df[df.duplicated()]

print(duplicate_rows_df.shape)


In [None]:
# whoooops! not cool - let's remove them.
df = df.drop_duplicates(keep='first')

print(df.shape)

<div>
<img src="attachment:image.png" width="700"/>
</div>

## <font color='#eb3483'> 2.3. Check data types </font>


In [None]:
df.head()

In [None]:
#lets have a quick look at our data types
df.dtypes

In [None]:
#Any issues here?

In [None]:
# How about changing the host_since to DateTime column. - Hint you can use the function 'to_datetime'
df.host_since = pd.to_datetime(df["host_since"])


In [None]:
df.dtypes
df.head()

In [None]:
#lets change host ID to an object. isnt really a number - its a category
# cant remember how - google is your friend.
df["host_id"]= df["host_id"].astype(object)
df.dtypes

<div>
<img src="attachment:image.png" width="700"/>
</div>

## <font color='#eb3483'> 2.4. Missing values </font>


In [None]:
print(df.isnull().sum())

### <font color='#eb3483'>Missigno </font>
This package let's us view how our missing data is spread out across rows and columns in a super convenient visual format (package found here: https://github.com/ResidentMario/missingno)

You can install this package by using `pip install missingno`.

In [None]:
# you can use terminal commands inside jupyter notebooks using ! notation
#!pip install missingno
# usually either pip or conda install will do the job for installing a package.

In [None]:
import missingno as msno

msno.matrix(df);

In [None]:
# what do we see here. 

# drop any columns you might need ... but be careful :)


What CAN one do about missing values ... 
1. Remove those records with missing values  
`df = df.dropna()`   
or  
`df = df.dropna(subset=["column2", "column5", "this_column", "that_column"])`  


2. Replace the null values with a particular value, for example 0 or "missing". It is a simple technique but adds noise (because it assumes the null values are one specific case).  
`df["column1"] = df.column1.fillna("missing")`  
`df["this_column"] = df.this_columns.fillna(0)`  


3. Data Imputation: We can replace the missing values with a particular value, but use some criteria to choose that value. Common imputation practices are imputing with the mean, mode or median.

In [None]:
# impute missing values for cleaning fee - do distributions and impute with mean.
cleaning_fee_mean = df.cleaning_fee.mean()

df.cleaning_fee = df.cleaning_fee.fillna(cleaning_fee_mean)


There are other techniques to deal with missing values:

- Use a predictive model to predict the missing values.

- More sophisticated methods: [MICE](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/) is a method that deals with missing values, and in this [repository](https://github.com/hammerlab/fancyimpute) there are more methods

[Here](https://gallery.cortanaintelligence.com/Experiment/Methods-for-handling-missing-values-1) there are more strategies

<div>
<img src="attachment:image.png" width="700"/>
</div>

## <font color='#eb3483'> 2.5. Looking for and dealing with outliers </font>


In [None]:
df.describe()

In [None]:
#make a histogram for accommodates using sns
sns.distplot(df.accommodates, kde=False, bins= 20)

In [None]:
#make a boxplot for accommodates using sns
sns.boxplot(df.accommodates)

In [None]:
# lets check out the unique values (always useful)
df.accommodates.unique()

In [None]:
# hmmmmm, 40 looks quite high. lets check this out ... is this a true outlier?
# Pull out values > 15 and have a look see wht you think
df[df.accommodates > 19]

In [None]:
#Check out bedrooms in the same way.

In [None]:
#Check out price in the same way.

In [None]:
df["price_usd"] = round(df["price"]/15,0)
df.head()

In [None]:
df.price_usd.describe()

In [None]:
sns.distplot(df.price_usd, kde=False)

In [None]:
sns.boxplot(df.price_usd)

In [None]:
df[df.price_usd>50000]

In [None]:
# Let's deal with some price issues
#check out shape to remind yourself
#Remove rows with price greater than 500 000 or 60000USD (hint you can do this by retaining anything under)


df = df[df.price_usd < 60000]

### IMPORTANT Remember - to include assumptions and steps you took about outliers in your notes at the end of your notebook

In [None]:
# check shape again

In [None]:
#now replot boxblot of price

In [None]:
# what if we HAD LOTS OF VARIABLES and wanted to look at all the numerical values at once. 
numerical = [
  'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating'
]

#could also pull out columns based on dtype
#numerical = df.select_dtypes(include=np.number) # different way of doing the above. 


In [None]:
df[numerical].hist(bins=15, figsize=(15, 8), layout=(3, 4));


EXERCISE: Take 10-15 minutes and investigate cleaning fee, number of reviews, bedrooms, batherooms, beds and review scores rating and see whether you think there might be any outliers in there. 

<hr>

In general, extreme values are those values that are different than the variable distribution, and estimating summary statistics for a column with outliers yields unreliable results.

One common practice is to consider outliers those values with a z score higher than 3 (that means, they are 3 standard deviations   bigger or smaller than the mean.

z score is defined as:

$$z(x)= \frac{x-\mu}{\sigma}$$

So you can always double check whether values fall above this for potential extreme outliers.

![image.png](attachment:image.png)

## <font color='#eb3483'> 3. Cardinality: Assessing categorical variables </font>


It is good practice to look at the categorical variables to get an idea of the cardinality - and how useful they might be in groupings or as predictive varaibles. 


High cardinality =  variables with few repeated values (ie all different)    
Low cardinality = many repeated values (ie almost all one type)



In [None]:
#remind yourself of data types
df.dtypes

In [None]:
df.property_type.unique()

In [None]:
sns.countplot(y = df['property_type']);


In [None]:
#Or we can view this as a ratio
df.property_type.value_counts(normalize=True).plot.barh(); # as a ratio

<hr>

So what can we summarize from our above steps?

### <font color='#eb3483'> Data Dictionary <font color='#eb3483'>

It is important to write down the description and datatypes of the variables.
(in our case it's pretty self explnatory but you may have done some major transformations and its good to try and keep track of them)  - especially for when sharing notebooks (think internships) or coming back to it months later.

* id            --                     int64  
* host_id         --                categorical  
* host_since       --               date 
* neighbourhood      --             categorical  
* city        --                    categorical  
* zipcode      --                   categorical  
* latitude      --                 float64  
* longitude     --                 float64  
* property_type       --            categorical  
* room_type           --            categorical  
* accommodates        --             int64  
* bathrooms           --           float64  
* bedrooms             --          float64  
* beds                 --          float64  
* bed_type             --           categorical  
* price                --            int64  
* cleaning_fee         --          float64  
* number_of_reviews    --            int64  
* review_scores_rating  --         float64  




### <font color='#eb3483'> Data processing steps </font>
- There are xxx duplicate rows (we have removed them)
- The variables `xxx, xxx, xxx and xxx` have missing values - what did we do with these?
- The categorical variable `xxx, xxx` has a dominant class (65% of xxx are xxx, etc)
- There are outliers in the variables `xxx and xxx` - what did we do with these?


### <font color='#eb3483'> Variable Exploration Description <font color='#eb3483'>
(Distributions & Cardinality)  
Here we describe the possible entities(groupings) that we can break our dataset into, this will help us think of different ways to slice and group the dataset in further steps.

- Use neighbourhood or zipcode (but what does Neighbourhood mean).   
- Most common zipcode is 8001 and ward is 115.
- City was almost all Cape Town, so not very informative for differentiation (ie low cardinality).
- Property_type - whole houses and apartments are the most common type.
- Room_type -> a lot of entire apartments and shared rooms. 
- bed type -> predominently real beds. Not much value in this variable.
- Accommodates - > good range of sizes of properties.


### <font color='#eb3483'>  Saving our data </font>
After each step it is important to save the dataset with a different name (so we dont modify the original).

In [None]:
df.to_csv("data/airbnb_processed.csv", index=False)

<hr>

### <font color='#eb3483'>An extra tit-bit: Pickling </font>


We usually export our datasets to csv, because it is a format that is easily readable in pretty much any platform.

However, CSV (`Comma Separated Values`) is a simple format, and when we export a dataframe to csv some of the information gets lost in translation, as categories will turn into text. 

One way to avoid this is to save the dataframe into a native python format `pickle`. Saving a dataframe as a pickle file has two main advantages. One is that reading the dataframe is much faster, because python can read the file from the hard drive as a dataframe directly (pandas doesnt have to read a text file and convert it into a dataframe). The second advantage is that we keep all of the original column dtypes.

Pandas can read and write pickle files very easily (`read_pickle` and `to_pickle`).

In [None]:
df.to_pickle("data/airbnb_processed.pkl")