# <font color='#eb3483'> Exploratory Data Analysis - Part 2 </font>


Now that our data is clean, we can begin to dig in and uncover some patterns 

In this notebook we cover:

- Digging into patterns
    - group counts (pivot table)
    - some averages (groupby and mean)
    - correlations (pairplot)

- Pandas profiling

In [None]:
# Importing the required packages here

import numpy as np
import pandas as pd
import seaborn as sns

from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

### <font color='#eb3483'>  Load our data </font>
After each step it is important to save the dataset with a different name (so we dont modify the original).

In [None]:
df = pd.read_csv("data/ny_airbnb_processed.csv")
df.head()

## <font color='#eb3483'> 4. Digging into patterns </font>

The final stage of our EDA journey is going to be to look at the relationship between variables in our data. The goal of this stage is to get a better understanding of how our data interacts with each other.

## <font color='#eb3483'> Multiple group counts </font>

In [None]:
def pivot_count(df, rows, columns): # this just makes a function that does a pivotcount for us with different variables.
    df_pivot = df.pivot_table(values="id", # could be any column, since we are just counting rows 
                              index=rows, 
                              columns=columns, 
                              aggfunc=np.size
                             ).dropna(axis=0, how='all')
    return df_pivot

In [None]:
#lets remind ourselves of our variables
df.head()

In [None]:
pivot_count(df, "room_type","accommodates")


In [None]:
# which neighbourhood_groups have the greatest number of choices for accommodating very large groups?


In [None]:
# play around with some options. How about neighbourhood_group and price
pivot_count(df, "neighbourhood_group","price")

#any thoughts?

In [None]:
# observations - not the best method to use with numerical variables (toooooooo many options!)
#errrrr how can we have properties that cost $0 I would definitely stay there. 

# let's check this ... 
#pull out all properties that cost less than $5


In [None]:
#Remember - cleaning data is not linear. Lets fix that - make cut off 10
df = df[df.price > 10]


## <font color='#eb3483'> Categorical Variable Means </font>

In [None]:
#check out average price per room type
df.groupby("room_type")["price"].mean().plot.barh()   


In [None]:
#plot mean price of properties per neighbourhood_group 
df.groupby("neighbourhood_group")["price"].mean().plot.barh()   


In [None]:
# let's use seaborn instead
my_colors = palette = sns.color_palette("Set2")
sns.barplot(data = df, y = "neighbourhood_group", x="price", palette = my_colors)

In [None]:
#plot mean price of properties per neighbourhood 
plt.figure(figsize=(13,30))
df.groupby("neighbourhood")["price"].mean().plot.barh()   

In [None]:
# try some different options that might be interesting for you. 


##  <font color='#eb3483'> Correlations </font>


An easy way to visually see all correlations is by using the `pairplot` function we learned with seaborn. However, this only works when there aren't many columns. 

**Note** Scatterplot matrices computation are a bit heavy, this might take a while in your computer.

In [None]:
#- not going to run this now but try it out for yourself later (takes about a min)
%time sns.pairplot(df) 

# remember if you want to kill a cell running hit stop above.
# you could always pull out a smaller set of the data you are interested in.

In [None]:
# what is %time - we call these - magics. - 
# they are commands that we can run that give us info about what we are running.
#In this case this is a cell magic that tells us how long that cell takes to run.
#You can see all magics by:
%lsmagic


In [None]:
df.head()

In [None]:
#QUESTION -  Do more expensive rentals get better ratings?
#aka Is there a relationship between price and review_rating?

sns.regplot("price", "review_rating", data=df);

In [None]:
#try out any others that interest you

<hr>

Once we have finished the analyisis, the last step is to compile all of our findings and put it on one document. This document serves 2 purposes:

- It helps inform people in the future of our findings. That person might even be ourselves!
- Facilitate the replication of the analysis by other Data analysts.

Let's summarize all the work we've done!

# <font color='#eb3483'> Analysis Conclusion </font>

### <font color='#eb3483'> Description </font>

Describe the data set. What it is. Where it comes from. Maybe the time period it covers. etc.



### <font color='#eb3483'> Data dictionary </font>

The variables that exist on the dataset are:

```
* xxx (categorical)
etc.
```

### <font color='#eb3483'> Data Processing </font>

- copy from previously

### <font color='#eb3483'> Variable Exploration Description <font color='#eb3483'>
- copy from previously


### <font color='#eb3483'> Comparisons </font>

- list any insights that stood out for you here.


### <font color='#eb3483'> Conclusions </font>
- list any major conclusions


<hr>

## <font color='#eb3483'> 5. Pandas Profiling </font>


To save us some trouble - pandas has a very nifty function that helps with some EDA  

An awesome package that you can check-out here: https://github.com/JosPolfliet/pandas-profiling

Here is a great example of Pandas profiling https://medium.com/analytics-vidhya/pandas-profiling-5ecd0b977ecd

There are also some other great options to check out here - There are also some other options that you can read about here: https://medium.com/@pyrootml/5-most-important-tools-for-advanced-eda-exploratory-data-analysis-e2b2f60a537

With pandas-profiling we can generate a report from a pandas dataframe that provides a ton of information about the data

In [None]:
#pip install pandas_profiling

In [None]:
#pip install markupsafe==2.0.1

In [None]:
import pandas_profiling

In [None]:
df = pd.read_csv("data/ny_airbnb_processed.csv")

In [None]:
#not going to run as it takes quite a bit of time.
report = pandas_profiling.ProfileReport(df)

In [None]:
report

We can even save the report as an html, this is very useful when sharing with a colleague

In [None]:
type(report)


In [None]:
report.to_file(output_file = 'profiling_report.html')

For more options on auto read [this](https://medium.com/@pyrootml/5-most-important-tools-for-advanced-eda-exploratory-data-analysis-e2b2f60a537)

These are good starting options to pick out trends and irregularities in your data set - but real diving and understanding of the data is always better when you do it yourself manually