# <font color='#eb3483'> Exploratory Data Analysis - Part 2 </font>


Let's remind ourselves of where we left off

### <font color='#eb3483'> Data processing steps </font>
- There are xxx duplicate rows (we have removed them)
- The variables `xxx, xxx, xxx and xxx` have missing values - what did we do with these?
- The categorical variable `xxx, xxx` has a dominant class (65% of xxx are xxx, etc)
- There are outliers in the variables `xxx and xxx` - what did we do with these?


### <font color='#eb3483'> Entity Description <font color='#eb3483'>

Here we describe the possible entities(groupings) that we can break our dataset into, this will help us think of different ways to slice and group the dataset in further steps.

- 5 neighbourhood _groups
- xxx neighbourhoods
- Room_type -> xxx
- Accommodates - > good range of sizes of properties.


In [4]:
# Importing the required packages here

import numpy as np
import pandas as pd
import seaborn as sns

from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

### <font color='#eb3483'>  Load our data </font>
After each step it is important to save the dataset with a different name (so we dont modify the original).

In [5]:
df = pd.read_csv("data/ny_airbnb_processed.csv")
df.head()

Unnamed: 0,id,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_rating,year
0,2595,2845,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,,1.0,150,30,48,2019-11-04,0.34,3,308.0,4.7,2019.0
1,3831,4869,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,1.0,3.0,79,1,403,2021-05-04,5.16,1,208.0,4.46,2021.0
2,5121,7356,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,1.0,1.0,60,30,50,2016-06-05,0.56,1,365.0,4.52,2016.0
3,5136,7378,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,2.0,2.0,175,21,1,2014-01-02,0.01,1,134.0,5.0,2014.0
4,5178,8967,Manhattan,Midtown,40.76457,-73.98317,Private room,2,1.0,1.0,61,2,474,2020-09-25,3.61,1,246.0,4.19,2020.0


## <font color='#eb3483'> 4. Digging into patterns </font>

The final stage of our EDA journey is going to be to look at the relationship between variables in our data. The goal of this stage is to get a better understanding of how our data interacts with each other.

## <font color='#eb3483'> Multiple group counts </font>

In [6]:
def pivot_count(df, rows, columns): # this just makes a function that does a pivotcount for us with different variables.
    df_pivot = df.pivot_table(values="id", # could be any column, since we are just counting rows 
                              index=rows, 
                              columns=columns, 
                              aggfunc=np.size
                             ).dropna(axis=0, how='all')
    return df_pivot

In [7]:
#lets remind ourselves of our variables
df.head()

Unnamed: 0,id,host_id,neighbourhood_group,neighbourhood,latitude,longitude,room_type,accommodates,bedrooms,beds,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,review_rating,year
0,2595,2845,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,2,,1.0,150,30,48,2019-11-04,0.34,3,308.0,4.7,2019.0
1,3831,4869,Brooklyn,Bedford-Stuyvesant,40.68494,-73.95765,Entire home/apt,3,1.0,3.0,79,1,403,2021-05-04,5.16,1,208.0,4.46,2021.0
2,5121,7356,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,2,1.0,1.0,60,30,50,2016-06-05,0.56,1,365.0,4.52,2016.0
3,5136,7378,Brooklyn,Sunset Park,40.66265,-73.99454,Entire home/apt,4,2.0,2.0,175,21,1,2014-01-02,0.01,1,134.0,5.0,2014.0
4,5178,8967,Manhattan,Midtown,40.76457,-73.98317,Private room,2,1.0,1.0,61,2,474,2020-09-25,3.61,1,246.0,4.19,2020.0


In [8]:
room_accommodates = pivot_count(df, "room_type","accommodates")
room_accommodates

accommodates,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Entire home/apt,,497.0,6710.0,2985.0,4754.0,1408.0,1588.0,303.0,464.0,56.0,167.0,30.0,79.0,21.0,16.0,8.0,79.0
Hotel room,23.0,3.0,166.0,15.0,58.0,1.0,11.0,,1.0,,,,,,,,
Private room,,5301.0,9417.0,622.0,606.0,71.0,74.0,14.0,27.0,3.0,10.0,2.0,6.0,1.0,2.0,3.0,14.0
Shared room,,356.0,183.0,58.0,25.0,6.0,3.0,,1.0,,1.0,,1.0,1.0,,,2.0


In [10]:
# which neighbourhood_groups have the greatest number of choices for accommodating very large groups?
pivot_count(df, "neighbourhood_group","accommodates")


accommodates,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
neighbourhood_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Bronx,,208.0,440.0,94.0,133.0,42.0,37.0,11.0,12.0,2.0,9.0,3.0,6.0,,1.0,2.0,3.0
Brooklyn,3.0,2449.0,6495.0,1247.0,2158.0,609.0,728.0,131.0,236.0,21.0,84.0,9.0,24.0,5.0,5.0,5.0,44.0
Manhattan,20.0,2346.0,7270.0,1929.0,2544.0,665.0,696.0,117.0,160.0,21.0,52.0,12.0,34.0,5.0,9.0,2.0,31.0
Queens,,1103.0,2155.0,388.0,560.0,151.0,194.0,56.0,74.0,15.0,30.0,6.0,19.0,13.0,3.0,2.0,16.0
Staten Island,,51.0,116.0,22.0,48.0,19.0,21.0,2.0,11.0,,3.0,2.0,3.0,,,,1.0


In [None]:
# play around with some options. How about neighbourhood_group and price
pivot_count(df, "neighbourhood_group","price")

#any thoughts?

In [None]:
# observations - not the best method to use with numerical variables (toooooooo many options!)
#errrrr how can we have properties that cost $0 I would definitely stay there. 

# let's check this ... 
#df[df.price <10]


In [None]:
#Remember - cleaning data is not linear. Lets fix that 
df = df[df.price > 10]


<div>
<img src="attachment:image.png" width="700"/>
</div>

## <font color='#eb3483'> Categorical Variable Means </font>

In [None]:
#check out average price per room type and property type
df.groupby("room_type")["price"].mean().plot.barh()   


In [None]:
#plot mean price of properties per neighbourhood_group 
df.groupby("neighbourhood_group")["price"].mean().plot.barh()   

In [None]:
#plot mean price of properties per neighbourhood 
plt.figure(figsize=(13,30))
df.groupby("neighbourhood")["price"].mean().plot.barh()   

In [None]:
# try some different options that might be interesting for you. 


<div>
<img src="attachment:image.png" width="700"/>
</div>

##  <font color='#eb3483'> Correlations </font>


An easy way to visually see all correlations is by using the `pairplot` function we learned with seaborn. However, this only works when there aren't many columns. 

**Note** Scatterplot matrices computation are a bit heavy, this might take a while in your computer.

In [None]:
#sns.pairplot(df) #- not going to run this now but try it out for yourself later

In [None]:
df.head()

In [None]:
# Do more expensive rentals get better ratings
#Is there a relationship between price and review_score_rating?

sns.regplot("price", "review_rating", data=df);

In [None]:
#try out any others that interest you

<hr>

Once we have finished the analyisis, the last step is to compile all of our findings and put it on one document. This document serves 2 purposes:

- It helps inform people in the future of our findings. That person might even be ourselves!
- Facilitate the replication of the analysis by other Data analysts.

Let's summarize all the work we've done!

# <font color='#eb3483'> Analysis Conclusion </font>

### <font color='#eb3483'> Description </font>

Describe the data set. What it is. Where it comes from. Maybe the time period it covers. etc.



### <font color='#eb3483'> Data dictionary </font>

The variables that exist on the dataset are:

```
* xxx (categorical)
etc.
```

### <font color='#eb3483'> Data Processing </font>

- copy from previously

### <font color='#eb3483'> Variable Exploration Description <font color='#eb3483'>
- copy from previously


### <font color='#eb3483'> Comparisons </font>

- list any insights that stood out for you here.


### <font color='#eb3483'> Conclusions </font>
- list any major conclusions


<hr>

![image.png](attachment:image.png)

## <font color='#eb3483'> 5. Pandas Profiling </font>


To save us some trouble - pandas has a very nifty function that helps with some EDA  

An awesome package that you can check-out here: https://github.com/JosPolfliet/pandas-profiling

Here is a great example of Pandas profiling https://medium.com/analytics-vidhya/pandas-profiling-5ecd0b977ecd

(warning it can be buggy and hard to get working but if you do it works well)

To install you can run `conda install -c conda-forge pandas-profiling` - in terminal

With pandas-profiling we can generate a report from a pandas dataframe that provides a ton of information about the data

In [None]:
#pip install pandas_profiling

In [None]:
import pandas_profiling

In [None]:
df = pd.read_csv("data/ny_airbnb_processed.csv")

In [None]:
#not going to run as it takes quite a bit of time.
report = pandas_profiling.ProfileReport(df)

In [None]:
report

We can even save the report as an html, this is very useful when sharing with a colleague

In [None]:
type(report)


In [None]:
report.to_file(output_file = 'profiling_report.html')