# <center> Module 2: Lesson 2 Data Exploration </center>
##  <center>Sample Data Exploration <br> with Pandas Library Functions </center>
<center>by: Nicole Woodland, P. Eng. for RoboGarden Inc. </center>

---

This notebook demonstrates multiple ways to perform data exploration on new datasets. Some process in data exploration include how to:
- Using summary statistics.
- Calculating Standard Deviation and variance with pandas.
- Arranging data using arrays in numpy.
- Calculating similarity/discimilarity.

Dataset: https://www.kaggle.com/datasets/crawford/80-cereals
Note: Modifications have been made to the original dataset to showcase the different functions.

In [3]:
import pandas as pd
df = pd.read_csv("../cereal_mod.csv")

In [13]:
df.head(5)

Unnamed: 0,name,mfr,type,calories,protein,fat,fat.1,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70.0,4,1,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120.0,3,5,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,,70.0,4,1,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,,50.0,4,0,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,,2,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [7]:
# See a summary table of the data columns & Null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 17 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      78 non-null     object 
 1   mfr       78 non-null     object 
 2   type      75 non-null     object 
 3   calories  77 non-null     float64
 4   protein   78 non-null     int64  
 5   fat       78 non-null     int64  
 6   fat.1     78 non-null     int64  
 7   sodium    78 non-null     int64  
 8   fiber     78 non-null     float64
 9   carbo     78 non-null     float64
 10  sugars    78 non-null     int64  
 11  potass    78 non-null     int64  
 12  vitamins  78 non-null     int64  
 13  shelf     78 non-null     int64  
 14  weight    78 non-null     float64
 15  cups      78 non-null     float64
 16  rating    78 non-null     float64
dtypes: float64(6), int64(8), object(3)
memory usage: 10.5+ KB


In [9]:
# Get table summary statistics
df.describe()

Unnamed: 0,calories,protein,fat,fat.1,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
count,77.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0
mean,106.753247,2.538462,1.0,1.0,161.346154,2.137179,14.679487,6.858974,95.294872,28.205128,2.192308,1.029231,0.823333,42.7067
std,19.496394,1.08941,1.006473,1.006473,84.583291,2.371427,4.312451,4.450957,71.159252,22.200011,0.83833,0.149534,0.232086,13.96047
min,50.0,1.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,1.0,0.5,0.25,18.042851
25%,100.0,2.0,0.0,0.0,131.25,1.0,12.0,3.0,40.0,25.0,1.0,1.0,0.67,33.37649
50%,110.0,2.5,1.0,1.0,180.0,1.75,14.5,6.5,90.0,25.0,2.0,1.0,0.75,40.42449
75%,110.0,3.0,1.75,1.75,217.5,3.0,17.0,11.0,120.0,25.0,3.0,1.0,1.0,50.812544
max,160.0,6.0,5.0,5.0,320.0,14.0,23.0,15.0,330.0,100.0,3.0,1.5,1.5,93.704912


### View summary statistics of numerical and categorical data
Numerical:

In [15]:
df['calories'].median()

110.0

In [21]:
df['fiber'].mean()

2.137179487179487

In [19]:
#Standard Deviation:
df['calories'].std()

19.496393623434326

![image.png](attachment:3ff55c54-27e4-4784-8b92-62c0957b4e6e.png)

In [24]:
df['calories'].var()

380.1093643198906

### Correlation: 
How strongly different pairs of variables relate to each other
- 1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when one variable increases, the other variable decreases).
- 0: No correlation. The variables do not have a relationship with each other.
- 1: Perfect positive correlation. The variables tend to move in the same direction (i.e., when one variable increases, the other variable also increases).


![Correlation](https://upload.wikimedia.org/wikipedia/commons/3/34/Correlation_coefficient.png)

In [26]:
df.corr(numeric_only=True)

Unnamed: 0,calories,protein,fat,fat.1,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
calories,1.0,0.022417,0.500734,0.500734,0.288234,-0.289966,0.240576,0.562287,-0.060124,0.26617,0.100956,0.696981,0.084042,-0.690071
protein,0.022417,1.0,0.213201,0.213201,-0.06364,0.501887,-0.138324,-0.318928,0.551437,0.008261,0.141107,0.217028,-0.248094,0.468233
fat,0.500734,0.213201,1.0,1.0,-0.025171,0.022853,-0.330632,0.281207,0.202186,-0.029062,0.277054,0.215728,-0.184029,-0.409436
fat.1,0.500734,0.213201,1.0,1.0,-0.025171,0.022853,-0.330632,0.281207,0.202186,-0.029062,0.277054,0.215728,-0.184029,-0.409436
sodium,0.288234,-0.06364,-0.025171,-0.025171,1.0,-0.079081,0.374864,0.077265,-0.048907,0.352995,-0.09619,0.299856,0.132611,-0.390484
fiber,-0.289966,0.501887,0.022853,0.022853,-0.079081,1.0,-0.35973,-0.132995,0.903082,-0.031279,0.302081,0.248023,-0.51513,0.581654
carbo,0.240576,-0.138324,-0.330632,-0.330632,0.374864,-0.35973,1.0,-0.345421,-0.359415,0.251655,-0.126421,0.129411,0.372061,0.055656
sugars,0.562287,-0.318928,0.281207,0.281207,0.077265,-0.132995,-0.345421,1.0,0.033592,0.126209,0.118738,0.449796,-0.042913,-0.75669
potass,-0.060124,0.551437,0.202186,0.202186,-0.048907,0.903082,-0.359415,0.033592,1.0,0.022207,0.370002,0.416407,-0.499454,0.375718
vitamins,0.26617,0.008261,-0.029062,-0.029062,0.352995,-0.031279,0.251655,0.126209,0.022207,1.0,0.297914,0.320571,0.126451,-0.240859


### Can calculate r (Pearson Coefficient) by hand:

The Pearson correlation coefficient is given by 
$r = \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}
{\sqrt{\sum_{i=1}^{n} (x_{i} - \bar{x})^{2}} \, \sqrt{\sum_{i=1}^{n} (y_{i} - \bar{y})^{2}}}$.

where:

- $x_{i}$,$y_{i}$ = actual values  
- $\bar{x}$,$\bar{y}$ = mean of actual values

In [35]:
x_mean = df['carbo'].mean()
y_mean = df['protein'].mean()

x_diff = df['carbo'] - x_mean
y_diff = df['protein'] - y_mean

import numpy as np
numerator = np.sum(x_diff * y_diff)
denominator = np.sqrt(np.sum(x_diff**2)) * np.sqrt(np.sum(y_diff**2))
r = numerator / denominator
print(round(r,8))

-0.1383241


### Categorical Statistical Analysis:

In [28]:
# Get frequency of values in column 'mfr'
frequency = df["mfr"].value_counts()
frequency

mfr
K    24
G    22
P     9
Q     8
R     8
N     6
A     1
Name: count, dtype: int64

In [30]:
df['mfr'].mode()

0    K
Name: mfr, dtype: object

### Use numpy to anayze the data

Arrays and their shapes:

In [None]:
import numpy as np
array_cereal = 

In [None]:
array_cereal

In [None]:
print( 

In [None]:
# Check the number of dimensions


In [None]:
# check the number of rows
l

In [None]:
# Chack the number of rows and columns


In [None]:
# Take the first 5 valyes of the little array



In [None]:
# Reshape this data into 10 columns and two rows


### Join & Split Arrays

In [None]:
array_cereal_2 = 
print(array_cereal_2)
print("Left array shape:",array_cereal_1.shape,"\nRight array shape:",array_cereal_2.shape)


### Filtering on Arrays

In [None]:
new_array >= 