# Descriptive Statistics Review

In this second part of the lab, we are going to continue working with the data that we cleaned in the last part. 
Be sure to continue to write clean code and comment your work well!

First, lets import our libraries and the data we saved. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
diamonds = pd.read_csv('diamonds_clean.csv')
diamonds = diamonds.drop('Unnamed: 0', axis=1)

Now that we cleaned our data, we can proceed with some exploratory analysis. We will analyze the features that affect price the most.

Let's start by looking at how the charateristics of a diamond (especially the price, since that's our focus) change based on its color. Remember that you can use the `groupby()` method in pandas. 

**Using the `describe()` method, take a look on the dataset paying special attention to the variability. Comment what you see.**

In [None]:
#your code here

diamonds.describe().transpose()

In [None]:
#your comments here

# Some of the diamonds seem to be extremely expensive compared to the rest..

Let's proceed to check each feature separately. 

**Before starting, which features do you think that will affect the price most and why? You will contrast your hypotheses with your results.**

In [None]:
#your hypotheses here

# It seems to me that the size is the biggest indicator of the price. The color and clarity seem to increase but don’t show extreme outliers. With the size the 75 percentile seem to be reasonable sized. But the max size seems to be much bigger and probably explain the high prize.

## 1. The `color` column
First, let's look at the color column.

**For each different color, find the mean of each column. You should have a matrix with every color as rows and the columns `carat`, `clarity`, etc as columns.**

In [None]:
#your code here

diamonds.groupby('color', as_index=False)['carat','clarity','price'].mean()

**What do you see? Pay special attention to the relationship between price and color.**

In [None]:
#your thoughts here

# The better the color the lower the carat and price become.

Let's go further into the color feature. We will plot the frequency distribution of the diamonds color in our dataset. 

**Plot the distribution and analyze it. Remember that you can use the pandas `plot()` method.**

In [None]:
#your code here

import plotly.express as px

px.box(diamonds, y="color")

In [None]:
#your comments here

# It seems that most values are between 2 and 5.

## 2.The `carat` column 

Let's check the `carat` (weight), since this could also be a potential factor for price change.

**Find the mean of each column for each value of `carat` using the `groupby` method. Then comment your results.**

In [None]:
#your code here

diamonds.groupby('carat', as_index=False)['color','clarity','price'].mean()

In [None]:
#your comments









**Plot a histogram of the `carat` column by using the `plot` method (see the docs to find an easy way to do so). What is happening?**

In [None]:
#your code here

px.histogram(diamonds, x="carat")

# 3. The `table` and `clarity` column
Finally, let's check the `table`.

**Find the mean of each column for each value of `table` using the `groupby` method. Then comment your results.**

In [None]:
#your code here

diamonds.groupby('table', as_index=False)['color', 'carat','clarity','price'].mean()

In [None]:
#your comments here






**Finally, do the same with the `clarity` column.**

In [None]:
#your code here

diamonds.groupby('clarity', as_index=False)['color', 'carat','table','price'].mean()

In [None]:
#your comments here








**After looking at your results, which features do you think will affect price the most now? Regarding your hypotheses, do they match your final results? Provide a small overview.**

In [None]:
#your thoughts here












# Bonus: taking a deeper look with plots and correlations

To take deeper look, we will use the `pairplot` method of `seaborn` library. This method plots a scatterplot for each pair of features and in the diagonal the distribution of the feature.

So if you have many features it will take a while, be careful!


In [None]:
#Run this code
import seaborn as sns
sns.pairplot(diamonds, vars=['carat', 'color', 'clarity', 'price'], diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'});

**What do you see here? What relationships between variables are the most interesting?**

In [None]:
#your thoughts here









Now we will see a correlation matrix with a plot. As you know a higher correlation means that the feature could be an effect (**but is not for sure**) for the changes on the price.

We will see this with a matrix with colors. A lighter color means greater correlation. 

This is done with the `seaborn` library as well.

In [None]:
#Run this code
plt.figure(figsize=(20, 20))
p = sns.heatmap(diamonds.corr(), annot=True, square=True)

**What do you see here? Regarding the results before, does it fit with them?**

In [None]:
#your thoughts here











Finally, we will calculate the linear regression between the price and the weight. This will be done first by plotting it with the `seaborn` library and then calculating the error with the `scipy` library.

In [None]:
#Run this code
plt.figure(figsize=(10, 10))
sns.regplot(diamonds.carat, diamonds.price, scatter=True);


In [None]:
#Run this code
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(diamonds.carat, diamonds.price)
r2 = r_value ** 2
r2

**What do you think?**

In [None]:
#your thoughts here










**Would you do any other checks on other features? Do you have any comments regarding `carat`?**

In [None]:
#your thoughts here













**Conlcusion**

**From our dataset** we can conclude that although `color` and `clarity` have a classification, and thus an assigned importance or weight, they do not influence the monetary value of a diamond in determining way. While it is true that different colors or clarities may have different prices, upon closer examination those variations in price seem to be linked to `carat` (weight) and its `dimensions`. In our analysis, the key factor to determining a diamond's value was placed solely in the aforementioned features, since we can see in our correlation coefficients and in the coefficient of determination that these features are closely related.