# Data Analytics in Python 1
(JupyterLab and Jupyter Notebooks)  📓

---

### Content
* ##### Reading files ✨
    * Importing Libraries
    * Loading Data
 
* ##### Summarizing and describing Data ✨
    * Summarizing Categorical Data
    * Handling Missing Data
      
* ##### Correlation and Covariance ✨

---
## Reading files ✨

In this section we will learn how to:

* Read a CSV file
* Read a text file
* Read an Excel file
* Read data from a different working directory
* Read Python built-in datasets


In [23]:
# firstly, print all files in the folder

import os
os. listdir() 

['Data Analysis 2 - Copy.ipynb',
 'Data Analysis 2.ipynb',
 'iris.csv',
 'owid-covid-data.xlsx',
 'sample1.txt']

In [24]:
# call to get out current working directory 

os.getcwd() 

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\pythonbc'

> #### 1. Reading a CSV file
> * stores tabular data in plain text
> * text file format that uses commas to separate values
> * newlines to separate records
> * typically each line represents one data record

In [25]:
import pandas as pd

iris_df = pd.read_csv('iris.csv')
print("Data Read from CSV File:\n", iris_df.head())

Data Read from CSV File:
    sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2      se
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


<div style="background-color:rgb(255, 255, 255)">

> #### 2. Reading a Text File
> * when working with files in Python, it is recommended to use the with context manager to open files. This ensures that the file is properly closed and the resources are released, regardless of whether the program runs successfully or an exception occurs during the file operations.

In [26]:
f_text = open('sample1.txt', 'r')

print(f_text.read())
f_text.close()

Utilitatis causa amicitia est quaesita.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Collatio igitur ista te nihil iuvat. Honesta oratio, Socratica, Platonis etiam. Primum in nostrane potestate est, quid meminerimus? Duo Reges: constructio interrete. Quid, si etiam iucunda memoria est praeteritorum malorum? Si quidem, inquit, tollerem, sed relinquo. An nisi populari fama?

Quamquam id quidem licebit iis existimare, qui legerint. Summum a vobis bonum voluptas dicitur. At hoc in eo M. Refert tamen, quo modo. Quid sequatur, quid repugnet, vident. Iam id ipsum absurdum, maximum malum neglegi.


In [27]:
with open('sample1.txt') as f:
    print(f.read())

Utilitatis causa amicitia est quaesita.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Collatio igitur ista te nihil iuvat. Honesta oratio, Socratica, Platonis etiam. Primum in nostrane potestate est, quid meminerimus? Duo Reges: constructio interrete. Quid, si etiam iucunda memoria est praeteritorum malorum? Si quidem, inquit, tollerem, sed relinquo. An nisi populari fama?

Quamquam id quidem licebit iis existimare, qui legerint. Summum a vobis bonum voluptas dicitur. At hoc in eo M. Refert tamen, quo modo. Quid sequatur, quid repugnet, vident. Iam id ipsum absurdum, maximum malum neglegi.


> #### 3. Reading an Excel file

In [35]:
df_excel = pd.read_excel('Inventory.xlsx')
df_excel

Unnamed: 0,Inventory,Quantity
0,pens,100
1,rulers,100
2,books,50
3,bags,100


> #### 4. Reading data from a different working directory
> 
> Firstly, we'll learn how to read files from a folder different from the current working directory. Our file is currently in the folder "notebook" but we want to read a file from the folder "data". `Notebook -> Data` 
> 
> ##### Method 1: Include the file path, this is very simple
> We specify the path to the "data" folder relative to our current working directory. If the "data" folder is located at the same level as the "notebook" folder, the relative path will be ../data.

In [36]:
iris_df2 = pd.read_csv("../data/iris.csv")
iris_df2

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,se
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


> ##### Method 2: change our working directory (useful to know)
> 
> > This requires the os package which provides easy functions that allow us to interact and get Operating System information. For more information see [The Python Standard Library os documentation](https://docs.python.org/3/library/os.html)
> 
> Where we:
> * Check Your Current Working Directory
> * Specify the Path to the Data Folder
> * Read the CSV File

In [37]:
import os

os.getcwd() # call to get out current working directory as seen previously

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\pythonbc'

In [38]:
os.chdir('../data/') # changing our working directory to the data folder
os.getcwd()

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\data'

In [39]:
iris_df3 = pd.read_csv('iris.csv')
iris_df3

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,se
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [40]:
# remember to change back to your original working directory

os.chdir('../pythonbc/')
os.getcwd()

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\pythonbc'

### 5. Python build in datasets

In [10]:
% pip sklearn
import sklearn
from sklearn import datasets

dir(datasets) # to see all datasets
# print(datasets.load_iris().DESCR) # to see the description of the dataset

df_skl = pd.DataFrame(datasets.load_iris().data)
df_skl.columns = datasets.load_iris().feature_names
df_skl.head(5)

UsageError: Line magic function `%` not found.


## Summarizing and describing Data ✨

Descriptive statistics include:
* Measures of central tendency (center position of a distribution for a data set)
  > mean, median and mode
* Measures of variability or dispersion (spread of distribution)
  > variance or standard deviation, coefficient of variation, minimum and maximum values, IQR (Interquartile Range), skewness and kurtosis

In [48]:
iris_df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [49]:
iris_df.describe() # excludes the character columns and gives summary statistics of numeric columns only

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [50]:
iris_df.describe(include=['object']) # gives summary statistics of the character columns.

Unnamed: 0,species
count,150
unique,4
top,virginica
freq,50


In [51]:
iris_df.describe(include='all')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,4
top,,,,,virginica
freq,,,,,50
mean,5.843333,3.054,3.758667,1.198667,
std,0.828066,0.433594,1.76442,0.763161,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


In [52]:
def stat_descriptor(dataframe, field):
    print(field + " minimum value :" + str(dataframe[field].min()) + " cm") 
    print(field + " maximum value :" + str(dataframe[field].max()) + " cm") 
    print(field + " range :" + str(dataframe[field].max() - dataframe[field].min()) + " cm") 
    print(field + " variance :" + str(dataframe[field].var()) + " cm") 
    print(field + " standard deviation :" + str(dataframe[field].std()) + " cm") 
    print(field + " variance :" + str(dataframe[field].var()) + " cm") 
    print(field + " Q1 :" + str(dataframe[field].quantile(0.25)) + " cm") 
    print(field + " Q2 :" + str(dataframe[field].quantile(0.5)) + " cm") 
    print(field + " Q3 :" + str(dataframe[field].quantile(0.75)) + " cm") 
    print(field + " IQR :" + str(dataframe[field].quantile(0.75) - dataframe[field].quantile(0.25)) + " cm")
    print(field + " skewness :" + str(dataframe[field].skew()) + " cm") 
    print(field + " kurtosis :" + str(dataframe[field].kurt()) + " cm") 

In [53]:
stat_descriptor(iris_df, 'petal_width')

petal_width minimum value :0.1 cm
petal_width maximum value :2.5 cm
petal_width range :2.4 cm
petal_width variance :0.582414317673378 cm
petal_width standard deviation :0.7631607417008411 cm
petal_width variance :0.582414317673378 cm
petal_width Q1 :0.3 cm
petal_width Q2 :1.3 cm
petal_width Q3 :1.8 cm
petal_width IQR :1.5 cm
petal_width skewness :-0.10499656214412734 cm
petal_width kurtosis :-1.3397541711393433 cm


| **_Statistic_**       | **_Function_**  | **_Measures_**                                                 | notes                                                                                                                                                                                                                                          |
|-----------------------|-----------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Mean                  | .mean()         | the average value of a distribution of figures                 |                                                                                                                                                                                                                                                |
| Median                | .median()       | measure of centre of a dataset                                 |                                                                                                                                                                                                                                                |
| Mode                  | .mode()         |  the value that appears most frequently in a data set.       |                                                                                                                                                                                                                                                |
| Minimum value         | .min()          |                                                                |                                                                                                                                                                                                                                                |
| Maximum value         | .max()          |                                                                |                                                                                                                                                                                                                                                |
| Range                 | Max - Min       |                                                                |                                                                                                                                                                                                                                                |
| Variance              | .var()          | degree of spread in your dataset                               | In relation to the mean, larger variance = more spread data                                                                                                                                                                                    |
| Standard deviation    | .std()          | how disperse the data is in relation to the mean               | larger st.dev = lots of variance in the observed data around the mean                                                                                                                                                                          |
| Q1 or 25th percentile | .quantile(0.25) | value under which 25% of the data points are found (inc order) |                                                                                                                                                                                                                                                |
| Q2 or 50th percentile | .quantile(0.5)  | value under which 50% of the data points are found (inc order) |                                                                                                                                                                                                                                                |
| Q3 or 75th percentile | .quantile(0.75) | value under which 75% of the data points are found (inc order) |                                                                                                                                                                                                                                                |
| IQR                   | Q3 - Q1         | the spread of data                                             | larger spread of data = more variability                                                                                                                                                                                                       |
| Skewness              | .skew()         | asymmetry of data distribution                                 | -1 to -0.5 skewness = negatively skewed (most values are found on the right side of the mean) ** -0.5 to 0.5 = distribution of values is almost symmetrical  ** 0.5 to 1 = positively skewed ** lower than -1 or higher than 1 = highly skewed |
| Kurtosis              | .kurt()         | heaviness of the distribution tails                            | positive kurtosis = distribution more peaked than normal (normal dist.) ** negative kurtosis = shape flatter than normal (normal dist.) ** kurtosis greater than +2 = distribution is too peaked (thin bell)                                   |

### Covariance:
* measure of the linear association between two variables (measures how changes in one variable are associated with changes in another variable)
* positive =  variables tend to move in the same direction
* does not provide a standardized measure and is sensitive to the units of the variables

### Correlation:
* standardized measure of the linear relationship between two variables
* quantifies the strength and direction of the relationship on a scale from -1 to +1
* +1 indicates a perfect positive linear relationship
* -1 indicates a perfect negative linear relationship
* 0 indicates no linear relationship
* unitless and not affected by the scale or units of the variable (easier to compare strength of relationship

In [55]:
import numpy as np

np.shape(iris_df)
lengths_array = np.array([iris_df['petal_length'],iris_df['sepal_length']])

np.cov(lengths_array)

array([[3.11317942, 1.27368233],
       [1.27368233, 0.68569351]])

covariance between petal length and sepal length is 1.27 indicating that it is strongly positive. So if the petal length is higher, we would expect the sepal length to be higher. 

In [56]:
np.corrcoef(lengths_array)

array([[1.        , 0.87175416],
       [0.87175416, 1.        ]])