# Descriptive Statistics Review

## Before you start:

- Read the README.md file
- Comment as much as you can
- Happy learning!

## Context

![img](./diamonds.jpg)

In this lab we are going to work with data to understand the characteristics of a diamond that are most likely to influence its price. In this first part of the lab, we will explore and clean our data. 

The dataset we will be using is comprised of approximately 54k rows and 11 different columns. As always, a row represents a single observation (in this case a diamond) and each of the columns represent a different feature of a diamond.

The following codebook was provided together with the dataset to clarify what each column represents:


| Column  | Description  |
|---|---|
| Price  | Price in US dollars (326-18,823)  |
| Carat  | Weight of the diamond (0.2--5.01)  |
| Cut  | Quality of the cut (Fair, Good, Very Good, Premium, Ideal)  |
| Color  | Diamond colour, from J (worst) to D (best)  |
| Clarity  | A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))   |
| x  | Length in mm (0--10.74)  |
| y  | Width in mm (0--58.9)  |
| z  | Depth in mm (0--31.8)  |
| Depth  | Total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)  |
| Table  | Width of top of diamond relative to widest point (43--95)  |

## Libraries
Pandas and numpy will be needed for the analysis of the data. Don't worry about the seaborn and matplotlib import at the moment, you will learn more about them next week, but we will be using some of their functionalities.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

First import the data from the .csv file provided and assign it to a variable named `diamonds` and drop the column with the index.

In [2]:
diamonds = pd.read_csv('diamonds.csv')
diamonds = diamonds.drop('Unnamed: 0', axis=1)
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


# 1. Taking the first look at the data.
Let's see how the data looks by using pandas methods like `head()`, `info()` and `describe()`. 

**First, use the `head` method.**

In [3]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [5]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


We can see the first 5 rows of the dataset using the `head` method. This by itself doesn't tell us much about the data that we have, but we can have a first look at the features (columns) and some of the values that each one takes.

**What do you see? Make some comments about the values you see in each column, comparing them with the codebook. Is that what you would expect for these variables?**

In [6]:
diamonds.head(60)
#I would expect to see the values below when comparing descriptions in the codebook, however 
#I would suggest assigning number category to cut, color and clarity for better understanding.

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


It is very important to know the amount of data we have, because everything will depend on that, from the quality of the analysis to the choice of our infrastracture.

**Check the shape of the data**

In [7]:
diamonds.shape

(53940, 10)

The `clarity` column is confusing because we are not diamond experts. Let's create a new column with a new scale that is more understandable for us.

**Create a new column with numbers from 0 to 7. The lowest would be 0 with value `I1` and the greatest 7 with value `IF`**

In [8]:
#A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

diamonds.loc[diamonds['clarity'] == 'I1', 'clarity_num'] = 0
diamonds.loc[diamonds['clarity'] == 'SI2', 'clarity_num'] = 1
diamonds.loc[diamonds['clarity'] == 'SI1', 'clarity_num'] = 2
diamonds.loc[diamonds['clarity'] == 'VS2', 'clarity_num'] = 3
diamonds.loc[diamonds['clarity'] == 'VS1', 'clarity_num'] = 4
diamonds.loc[diamonds['clarity'] == 'VVS2', 'clarity_num'] = 5
diamonds.loc[diamonds['clarity'] == 'VVS1', 'clarity_num'] = 6
diamonds.loc[diamonds['clarity'] == 'IF', 'clarity_num'] = 7
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,clarity_num
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,1.0
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,2.0
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,4.0
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,3.0
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,1.0


It makes sense to do the same with the `color` column.

**Do the same with values from 0 to 6. Read the codebook to see the match**

In [9]:
#diamond colour, from J (worst) to D (best)
#J=0, I=1, H=2, G=3, F=4, E=5, D=6

diamonds.loc[diamonds['color'] == 'J', 'color_num'] = 0
diamonds.loc[diamonds['color'] == 'I', 'color_num'] = 1
diamonds.loc[diamonds['color'] == 'H', 'color_num'] = 2
diamonds.loc[diamonds['color'] == 'G', 'color_num'] = 3
diamonds.loc[diamonds['color'] == 'F', 'color_num'] = 4
diamonds.loc[diamonds['color'] == 'E', 'color_num'] = 5
diamonds.loc[diamonds['color'] == 'D', 'color_num'] = 6
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,clarity_num,color_num
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,1.0,5.0
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,2.0,5.0
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,4.0,5.0
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,3.0,1.0
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,1.0,0.0


With the `info` method, we can see the features of the dataset, and the amount of observations (rows) that have a non-null value and the types of the features. 

**Now use the `info` method and comparing with the shape, comment on what you see**

In [10]:
diamonds.info()
#info method gives us more detailed information such as column names and type where is shape method only gives us number of rows and columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   carat        53940 non-null  float64
 1   cut          53940 non-null  object 
 2   color        53940 non-null  object 
 3   clarity      53940 non-null  object 
 4   depth        53940 non-null  float64
 5   table        53940 non-null  float64
 6   price        53940 non-null  int64  
 7   x            53940 non-null  float64
 8   y            53940 non-null  float64
 9   z            53940 non-null  float64
 10  clarity_num  53940 non-null  float64
 11  color_num    53940 non-null  float64
dtypes: float64(8), int64(1), object(3)
memory usage: 4.9+ MB


In [11]:
diamonds.shape

(53940, 12)

In the last line of the info output, you have some information about the types of the columns. As you know, it is a good idea to check if the types of each column is what you expect. If a column has the right type, we will be able to do all the operations that we want to do. 

For instance, if we have a column that is a `date` with a `string` format, we will have the data but we won't be able to do a simple operation, such as format the date the way that we would like.

Changing the data type to the one we needs can help us to solve a lot of problems in our data.

**Check the types of each column and comment if it matches with the expected**

In [12]:
diamonds.dtypes

carat          float64
cut             object
color           object
clarity         object
depth          float64
table          float64
price            int64
x              float64
y              float64
z              float64
clarity_num    float64
color_num      float64
dtype: object

In [13]:
#The types for each column is what I would expect, however I would change new columns clarity_num and color_num to integers instead of floats.
diamonds['clarity_num'] = diamonds['clarity_num'].astype('int64')
diamonds['clarity_num'].dtype

dtype('int64')

In [14]:
diamonds['color_num'] = diamonds['color_num'].astype('int64')
diamonds['color_num'].dtype

dtype('int64')

# 2. A deeper look: checking the basic statistics.

The `describe` method gives us an overview of our data. From here we can see all the descriptive metrics for our variables.

**Use the `describe` method and comment on what you see**

In [15]:
diamonds.describe()
#We see that columns x, y and z have 0 values for minimum calculation

Unnamed: 0,carat,depth,table,price,x,y,z,clarity_num,color_num
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734,3.05102,3.405803
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699,1.647136,1.701105
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91,2.0,2.0
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53,3.0,3.0
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04,4.0,5.0
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8,7.0,6.0


You have probably noticed that the columns x, y and z have a minimum value of 0. This means that there are one or more rows (or observations) in our dataset that are supposedly representing a diamond that has lenght, width or depth of 0. Considering that we're talking about a physical object, this is impossible!

Now let's proceed to check the rows that have a value of 0 in any of the x, y or z columns. By doing this we want to check if the data we are missing can be obtained using the data that we do have.

**Check the columns with `x`, `y` and `z` with value 0 in all of them and comment what you see**

In [16]:
zeros = diamonds[(diamonds['x']==0) | (diamonds['y']==0) | (diamonds['z']==0)]
zeros

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,clarity_num,color_num
2207,1.0,Premium,G,SI2,59.1,59.0,3142,6.55,6.48,0.0,1,3
2314,1.01,Premium,H,I1,58.1,59.0,3167,6.66,6.6,0.0,0,2
4791,1.1,Premium,G,SI2,63.0,59.0,3696,6.5,6.47,0.0,1,3
5471,1.01,Premium,F,SI2,59.2,58.0,3837,6.5,6.47,0.0,1,4
10167,1.5,Good,G,I1,64.0,61.0,4731,7.15,7.04,0.0,0,3
11182,1.07,Ideal,F,SI2,61.6,56.0,4954,0.0,6.62,0.0,1,4
11963,1.0,Very Good,H,VS2,63.3,53.0,5139,0.0,0.0,0.0,3,2
13601,1.15,Ideal,G,VS2,59.2,56.0,5564,6.88,6.83,0.0,3,3
15951,1.14,Fair,G,VS1,57.5,67.0,6381,0.0,0.0,0.0,4,3
24394,2.18,Premium,H,SI2,59.4,61.0,12631,8.49,8.45,0.0,1,2


As you can see, we have 20 rows that have a value of 0 in some or all the aforementioned columns.
Most of them (12) are missing the z value, which we can obtain using the columns depth, x and y. 

20 rows with issues represent just 0.03% of our data (20 out of 53940) so it wouldn't be a big deal to remove them. Still, lets try to keep all the data we have. 

For those 12 rows, we will create a function that applies the formula given in the codebook and get the value of z. We will drop the other rows (8), since they are missing all 3 values or 2 of them.

**Create a function named `calculate_z` that applies the function in the codebook to one single row you give to the function**

In [24]:
import math
#get value z using columns depth, x and y
#z=((d*(x+y))/2

#My first try
#def calculate_z(x, y, d):
    #z = ((d/100)*(x + y))/2
    #return z

#calculate_z(6.55,6.48,59.1)

#def calculate_z(line):
    #return ((line.depth/100)*(line.x + line.y))/2

#Colleagues helped me with these functions

#Other possible functions from colleagues
#def truncate (a, d):
    #return math.floor(a*10**d)/(10**d)

#def calculate_z(row):
    #x = row['x']
    #y = row['y']
    #depth = row['depth']
    
    #if (x!=0) and (y!=0):
        #z= ((x+y)/2)*(depth/100)
    #else:
        #z=0
    #return truncate(z,2)

def calculate_z(row):
    if (row.z==0.0) & (row.y!=0.0) & (row.x!=0.0):
        return row.depth*(row.x+row.y)/200
    else:
        return row.z


In [25]:
zeros["z"] = zeros.apply(lambda row: calculate_z(row) if row['z']==0 else row['z'], axis=1)
zeros

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,clarity_num,color_num
2207,1.0,Premium,G,SI2,59.1,59.0,3142,6.55,6.48,3.85,1,3
2314,1.01,Premium,H,I1,58.1,59.0,3167,6.66,6.6,3.85,0,2
4791,1.1,Premium,G,SI2,63.0,59.0,3696,6.5,6.47,4.08,1,3
5471,1.01,Premium,F,SI2,59.2,58.0,3837,6.5,6.47,3.83,1,4
10167,1.5,Good,G,I1,64.0,61.0,4731,7.15,7.04,4.54,0,3
11182,1.07,Ideal,F,SI2,61.6,56.0,4954,0.0,6.62,0.0,1,4
11963,1.0,Very Good,H,VS2,63.3,53.0,5139,0.0,0.0,0.0,3,2
13601,1.15,Ideal,G,VS2,59.2,56.0,5564,6.88,6.83,4.05,3,3
15951,1.14,Fair,G,VS1,57.5,67.0,6381,0.0,0.0,0.0,4,3
24394,2.18,Premium,H,SI2,59.4,61.0,12631,8.49,8.45,5.03,1,2


**Apply it just to the rows with incorrect values**

In [26]:
diamonds["z"] = diamonds.apply(lambda row: calculate_z(row) if row['z'] ==0 else row['z'], axis = 1)
diamonds

#drop the zero values and then add values to main dataframe

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,clarity_num,color_num
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,1,5
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,2,5
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,4,5
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63,3,1
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50,2,6
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61,2,6
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56,2,6
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74,1,2


If we leave the other 8 values as they are, it would negatively affect our analysis, because these are data that do not make logical sense. Therefore it is better to consider those values as NaN values, since they are probably the result of a mistake or error during process of measuring and storing these values in a dataset.

To replace them we can use the pandas .replace() method and np.NaN.

**Replace the zero values in the `z` column for a NaN**

In [29]:
diamonds['z'] = diamonds['z'].replace(to_replace=0.0, value=np.NaN)
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   carat        53940 non-null  float64
 1   cut          53940 non-null  object 
 2   color        53940 non-null  object 
 3   clarity      53940 non-null  object 
 4   depth        53940 non-null  float64
 5   table        53940 non-null  float64
 6   price        53940 non-null  int64  
 7   x            53940 non-null  float64
 8   y            53940 non-null  float64
 9   z            53932 non-null  float64
 10  clarity_num  53940 non-null  int64  
 11  color_num    53940 non-null  int64  
dtypes: float64(6), int64(3), object(3)
memory usage: 4.9+ MB


----
# Bonus: check the new z values
Since we need to be 100% sure of our data, let's create a function that validates our z. To do so, we will use the same formula, but this time we will calculate the value of depth with the new value assigned to z.

**Create a function named `validate_z` that compares the `z`  in cells above with the one thrown by the formula and run it with the rows you changed in the cells above**

In [None]:
#your code here

Let's check the data again with the `describe()` method.

The minimum value for x, y and z should now be a positive number, as it should be for the physical measurements of an object.

Let's finish by checking for NaN values in the data. Since we introduced them ourselves using 'replace', we will surely find some, but there may be more that are unrelated to the x, y and z columns. Checking NaNs is a fundamental part of data cleaning and it's always better to do this kind of operations before proceeding with analysis.

**Check how many NaNs do you have, comment what you would do with those values, and then do so**

In [None]:
#your code here

# 3. Checking for outliers
Now we are going to revisit the summary table to check for outliers.

**Use the `describe` method again and comment on what you see. After that, check if you have any outliers** 

In [None]:
#your code here

In [None]:
#your comments here

To manage these outliers, we are going to filter our DataFrame, we're going to take all the values that have a price higher than the 75th percentile.

**Look for that quantile and filter the dataframe to clearly see the outliers. What do you think?**

In [None]:
#your code here

Our dataset is really big and the outliers are really far apart from the rest of the values. To see this more clearly we will use a boxplot, which plots the median, 25th and 75th quartile, the maximum and minimum, as well as any outliers.

In [None]:
#Run this code
fig, ax = plt.subplots(1,2, figsize=(10, 5))
sns.boxplot(y=diamonds.y, ax=ax[0])
sns.boxplot(y=diamonds.z, ax=ax[1])
plt.subplots_adjust(wspace=0.5)

Now we can see that all the values are within an acceptable range, but we have 2 big outliers in y and 1 in z. Now we know that our max values for y should be around 10 and the values for z should be around 6, so let's filter our dataset to find values higher than 10 in it.


In [None]:
#your code here

Now that we have found the outlier, let's use the function we defined earlier to correct this value. First, we need to change the value to 0 (because that's how we defined the function before) and then we will apply it.

**Apply `calculate_z` for the row with the outlier**

In [None]:
#your code here

Let's check if we actually corrected the outlier.

In [None]:
diamonds.loc[48410]

Cool! Now let's validate our new `z`. We will check if we obtain the same value of depth using our validate function. If the formula applies, this means could approximate the real value of `z`.

**Apply `validate_z` to the row used earlier**

In [None]:
#your code here

Now let's do the same for `y`. First, let's filter the DataFrame to find the outliers. We said that the maximum values should be around 10, so let's check what are the values above 10.

**Check the values greater than 10 in the `y` column** 

In [None]:
#your code here

We can clearly see that the 31.8 in row 49189 is an outlier for the y value. Also, we can see that the 58.9 value for `y` in row 24067 is actually its depth, so it was a mistake when they introduced the data. Let's create a function to fix these outliers.

**Create a function named `calculate_y` to calculate `y` using `z` and `x` the same way you did above**

In [None]:
#your code here

We will check the rows that had an outlier in `y`, to check that the values were changed.

**Check those rows (also validating with your function) and comment what you see**

Now that we have corrected or dropped all of our outliers, lets plot another box plot to double check.

In [None]:
#Run this code
fig, ax = plt.subplots(1,2, figsize=(10, 5))
sns.boxplot(y=diamonds.y, ax=ax[0])
sns.boxplot(y=diamonds.z, ax=ax[1])
plt.subplots_adjust(wspace=0.5)

**What do you think? Are these values more reasonable?**


In [None]:
#your thoughts here

**Once you are happy with your cleaning, save the cleaned data and continue to csv. Your new csv should be named ``diamonds_clean``**

In [31]:
diamonds.to_csv('diamonds_clean.csv', index=False)