# Data Analytics in Python 1
(JupyterLab and Jupyter Notebooks)  📓

---

### Content
* ##### Reading files 
    * Importing Libraries
    * Loading Data
* ##### Summarizing and describing Data
* ##### Correlation and Covariance 

---
## Reading files ✨

In this section we will learn how to:

* Read a CSV file
* Read a text file
* Read an Excel file
* Read data from a different working directory
* Read Python built-in datasets


In [1]:
import os

# call to get out current working directory 
os.getcwd() 

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\pythonbc\\Draft1'

In [2]:
# printing all files in the folder
os. listdir() 

['Data Analysis 1.ipynb',
 'Data Analysis 1.pptx',
 'Data Analysis 2.ipynb',
 'Data Analysis 2.pptx',
 'Data Analysis 3.pptx',
 'Data_Analysis_3.ipynb',
 'Inventory.xlsx',
 'iris.csv',
 'owid-covid-data.xlsx',
 'sample1.txt']

> #### 1. Reading a CSV file
> * stores tabular data in plain text
> * text file format that uses commas to separate values
> * newlines to separate records
> * typically each line represents one data record


In [3]:
# using pandas
import pandas as pd

iris_df = pd.read_csv('iris.csv')
print("Data Read from CSV File:\n", iris_df.head())

Data Read from CSV File:
    sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2      se
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


In [4]:
# getting the csv file column names, can also use .keys()
print(iris_df.columns)

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')


In [5]:
# call column
print(iris_df["sepal_length"])

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64


 the 'loc' function is mainly used when we want to select rows and columns based on their labels. Whereas 'iloc' is used when choosing rows and columns based on specific positions

In [6]:
# call a row at index
print(iris_df.iloc[2])

# call rows given an index
print(iris_df.loc[iris_df['species'] == 'setosa'])

sepal_length       4.7
sepal_width        3.2
petal_length       1.3
petal_width        0.2
species         setosa
Name: 2, dtype: object
    sepal_length  sepal_width  petal_length  petal_width species
1            4.9          3.0           1.4          0.2  setosa
2            4.7          3.2           1.3          0.2  setosa
3            4.6          3.1           1.5          0.2  setosa
4            5.0          3.6           1.4          0.2  setosa
5            5.4          3.9           1.7          0.4  setosa
6            4.6          3.4           1.4          0.3  setosa
7            5.0          3.4           1.5          0.2  setosa
8            4.4          2.9           1.4          0.2  setosa
9            4.9          3.1           1.5          0.1  setosa
10           5.4          3.7           1.5          0.2  setosa
11           4.8          3.4           1.6          0.2  setosa
12           4.8          3.0           1.4          0.1  setosa
13           4.3 

In [7]:
# Above we used Panda's to read the file, there are multiple other ways we can do this

# Using the `csv` Module
import csv

with open('iris.csv', mode='r', newline='') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
['5.1', '3.5', '1.4', '0.2', 'se']
['4.9', '3', '1.4', '0.2', 'setosa']
['4.7', '3.2', '1.3', '0.2', 'setosa']
['4.6', '3.1', '1.5', '0.2', 'setosa']
['5', '3.6', '1.4', '0.2', 'setosa']
['5.4', '3.9', '1.7', '0.4', 'setosa']
['4.6', '3.4', '1.4', '0.3', 'setosa']
['5', '3.4', '1.5', '0.2', 'setosa']
['4.4', '2.9', '1.4', '0.2', 'setosa']
['4.9', '3.1', '1.5', '0.1', 'setosa']
['5.4', '3.7', '1.5', '0.2', 'setosa']
['4.8', '3.4', '1.6', '0.2', 'setosa']
['4.8', '3', '1.4', '0.1', 'setosa']
['4.3', '3', '1.1', '0.1', 'setosa']
['5.8', '4', '1.2', '0.2', 'setosa']
['5.7', '4.4', '1.5', '0.4', 'setosa']
['5.4', '3.9', '1.3', '0.4', 'setosa']
['5.1', '3.5', '1.4', '0.3', 'setosa']
['5.7', '3.8', '1.7', '0.3', 'setosa']
['5.1', '3.8', '1.5', '0.3', 'setosa']
['5.4', '3.4', '1.7', '0.2', 'setosa']
['5.1', '3.7', '1.5', '0.4', 'setosa']
['4.6', '3.6', '1', '0.2', 'setosa']
['5.1', '3.3', '1.7', '0.5', 'setosa']
['4.8', 

In [8]:
#Using `numpy`
import numpy as np

# Note: The nan values in this numpy array indicate that there are non-numeric values in the CSV file or the delimiter (,) is incorrect. 
# This typically happens when numpy encounters unexpected strings or characters in the data it is trying to parse as numbers.

data = np.genfromtxt('iris.csv', delimiter=',', skip_header=1)
print(data)

[[5.1 3.5 1.4 0.2 nan]
 [4.9 3.  1.4 0.2 nan]
 [4.7 3.2 1.3 0.2 nan]
 [4.6 3.1 1.5 0.2 nan]
 [5.  3.6 1.4 0.2 nan]
 [5.4 3.9 1.7 0.4 nan]
 [4.6 3.4 1.4 0.3 nan]
 [5.  3.4 1.5 0.2 nan]
 [4.4 2.9 1.4 0.2 nan]
 [4.9 3.1 1.5 0.1 nan]
 [5.4 3.7 1.5 0.2 nan]
 [4.8 3.4 1.6 0.2 nan]
 [4.8 3.  1.4 0.1 nan]
 [4.3 3.  1.1 0.1 nan]
 [5.8 4.  1.2 0.2 nan]
 [5.7 4.4 1.5 0.4 nan]
 [5.4 3.9 1.3 0.4 nan]
 [5.1 3.5 1.4 0.3 nan]
 [5.7 3.8 1.7 0.3 nan]
 [5.1 3.8 1.5 0.3 nan]
 [5.4 3.4 1.7 0.2 nan]
 [5.1 3.7 1.5 0.4 nan]
 [4.6 3.6 1.  0.2 nan]
 [5.1 3.3 1.7 0.5 nan]
 [4.8 3.4 1.9 0.2 nan]
 [5.  3.  1.6 0.2 nan]
 [5.  3.4 1.6 0.4 nan]
 [5.2 3.5 1.5 0.2 nan]
 [5.2 3.4 1.4 0.2 nan]
 [4.7 3.2 1.6 0.2 nan]
 [4.8 3.1 1.6 0.2 nan]
 [5.4 3.4 1.5 0.4 nan]
 [5.2 4.1 1.5 0.1 nan]
 [5.5 4.2 1.4 0.2 nan]
 [4.9 3.1 1.5 0.1 nan]
 [5.  3.2 1.2 0.2 nan]
 [5.5 3.5 1.3 0.2 nan]
 [4.9 3.1 1.5 0.1 nan]
 [4.4 3.  1.3 0.2 nan]
 [5.1 3.4 1.5 0.2 nan]
 [5.  3.5 1.3 0.3 nan]
 [4.5 2.3 1.3 0.3 nan]
 [4.4 3.2 1.3 0.2 nan]
 [5.  3.5 1

In [9]:
# modify your code to handle missing string values:
    # Specify the Data Types (dtype): Use dtype to specify the expected data types for each column. This will allow genfromtxt to correctly parse numeric and string data.
    # Use the names Parameter: This parameter will automatically use the first row as column names if it's present.
    # Set missing_values and filling_values: Define what should be considered as missing values and how they should be filled.

dtype = [('col1', 'f8'), ('col2', 'f8'), ('col3', 'f8'), ('col4', 'f8'), ('col5', 'U50')]
# 'f8' represents a float data type.
# 'U50' represents a Unicode string with a maximum length of 50 characters.

data = np.genfromtxt('iris.csv', delimiter=',', skip_header=1, dtype=dtype, missing_values="", filling_values=np.nan)
# missing_values is set to an empty string "" to capture any missing string values.
# filling_values is set to np.nan for numeric columns and an empty string or any other placeholder for string columns if needed.

print(data)

[(5.1, 3.5, 1.4, 0.2, 'se') (4.9, 3. , 1.4, 0.2, 'setosa')
 (4.7, 3.2, 1.3, 0.2, 'setosa') (4.6, 3.1, 1.5, 0.2, 'setosa')
 (5. , 3.6, 1.4, 0.2, 'setosa') (5.4, 3.9, 1.7, 0.4, 'setosa')
 (4.6, 3.4, 1.4, 0.3, 'setosa') (5. , 3.4, 1.5, 0.2, 'setosa')
 (4.4, 2.9, 1.4, 0.2, 'setosa') (4.9, 3.1, 1.5, 0.1, 'setosa')
 (5.4, 3.7, 1.5, 0.2, 'setosa') (4.8, 3.4, 1.6, 0.2, 'setosa')
 (4.8, 3. , 1.4, 0.1, 'setosa') (4.3, 3. , 1.1, 0.1, 'setosa')
 (5.8, 4. , 1.2, 0.2, 'setosa') (5.7, 4.4, 1.5, 0.4, 'setosa')
 (5.4, 3.9, 1.3, 0.4, 'setosa') (5.1, 3.5, 1.4, 0.3, 'setosa')
 (5.7, 3.8, 1.7, 0.3, 'setosa') (5.1, 3.8, 1.5, 0.3, 'setosa')
 (5.4, 3.4, 1.7, 0.2, 'setosa') (5.1, 3.7, 1.5, 0.4, 'setosa')
 (4.6, 3.6, 1. , 0.2, 'setosa') (5.1, 3.3, 1.7, 0.5, 'setosa')
 (4.8, 3.4, 1.9, 0.2, 'setosa') (5. , 3. , 1.6, 0.2, 'setosa')
 (5. , 3.4, 1.6, 0.4, 'setosa') (5.2, 3.5, 1.5, 0.2, 'setosa')
 (5.2, 3.4, 1.4, 0.2, 'setosa') (4.7, 3.2, 1.6, 0.2, 'setosa')
 (4.8, 3.1, 1.6, 0.2, 'setosa') (5.4, 3.4, 1.5, 0.4, 'setos

<div style="background-color:rgb(255, 255, 255)">

> #### 2. Reading a Text File

In [10]:
# built-in open function to read text files.
f_text = open('sample1.txt', 'r')

print(f_text.read())
f_text.close()

Utilitatis causa amicitia est quaesita.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Collatio igitur ista te nihil iuvat. Honesta oratio, Socratica, Platonis etiam. Primum in nostrane potestate est, quid meminerimus? Duo Reges: constructio interrete. Quid, si etiam iucunda memoria est praeteritorum malorum? Si quidem, inquit, tollerem, sed relinquo. An nisi populari fama?

Quamquam id quidem licebit iis existimare, qui legerint. Summum a vobis bonum voluptas dicitur. At hoc in eo M. Refert tamen, quo modo. Quid sequatur, quid repugnet, vident. Iam id ipsum absurdum, maximum malum neglegi.


In [11]:
# when working with files in Python, it is recommended to use the with context manager to open files. 
# This ensures that the file is properly closed and the resources are released, regardless of whether the program runs successfully or an exception occurs during the file operations.

with open('sample1.txt') as f:
    print(f.read())

Utilitatis causa amicitia est quaesita.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Collatio igitur ista te nihil iuvat. Honesta oratio, Socratica, Platonis etiam. Primum in nostrane potestate est, quid meminerimus? Duo Reges: constructio interrete. Quid, si etiam iucunda memoria est praeteritorum malorum? Si quidem, inquit, tollerem, sed relinquo. An nisi populari fama?

Quamquam id quidem licebit iis existimare, qui legerint. Summum a vobis bonum voluptas dicitur. At hoc in eo M. Refert tamen, quo modo. Quid sequatur, quid repugnet, vident. Iam id ipsum absurdum, maximum malum neglegi.


In [12]:
# reading into a list
with open('sample1.txt', 'r') as file:
    lines = file.readlines()
    print(lines)

['Utilitatis causa amicitia est quaesita.\n', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Collatio igitur ista te nihil iuvat. Honesta oratio, Socratica, Platonis etiam. Primum in nostrane potestate est, quid meminerimus? Duo Reges: constructio interrete. Quid, si etiam iucunda memoria est praeteritorum malorum? Si quidem, inquit, tollerem, sed relinquo. An nisi populari fama?\n', '\n', 'Quamquam id quidem licebit iis existimare, qui legerint. Summum a vobis bonum voluptas dicitur. At hoc in eo M. Refert tamen, quo modo. Quid sequatur, quid repugnet, vident. Iam id ipsum absurdum, maximum malum neglegi.']


In [13]:
# using pandas
import pandas as pd

df = pd.read_csv('sample1.txt', delimiter='\t')  # Adjust delimiter as needed
print(df)

             Utilitatis causa amicitia est quaesita.
0  Lorem ipsum dolor sit amet, consectetur adipis...
1  Quamquam id quidem licebit iis existimare, qui...


> #### 3. Reading an Excel file

In [14]:
# using pandas

# Read a single sheet
df_excel = pd.read_excel('Inventory.xlsx')
print(df_excel)

  Inventory  Quantity
0      pens       100
1    rulers       100
2     books        50
3      bags       100


In [15]:
# Read all sheets
dfs = pd.read_excel('Inventory.xlsx', sheet_name=None)
for sheet, df in dfs.items():
    print(f"Sheet: {sheet}")
    print(df)

Sheet: Sheet1
  Inventory  Quantity
0      pens       100
1    rulers       100
2     books        50
3      bags       100
Sheet: Sheet2
  Inventory2  Quantity
0       pens        10
1     rulers        10
2      books         5
3       bags        10


> #### 4. Reading data from a different working directory
> 
> Firstly, we'll learn how to read files from a folder different from the current working directory. Our file is currently in the folder "notebook" but we want to read a file from the folder "data". `Notebook -> Data` 
> 
> ##### Method 1: Include the file path, this is very simple
> We specify the path to the "data" folder relative to our current working directory. If the "data" folder is located at the same level as the "notebook" folder, the relative path will be ../data.

In [16]:
iris_df2 = pd.read_csv("../data/iris.csv")
iris_df2 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,se
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


> ##### Method 2: change our working directory (useful to know)
> 
> > This requires the os package which provides easy functions that allow us to interact and get Operating System information. For more information see [The Python Standard Library os documentation](https://docs.python.org/3/library/os.html)
> 
> Where we:
> * Check Your Current Working Directory
> * Specify the Path to the Data Folder
> * Read the CSV File

In [17]:
import os

os.getcwd() # call to get out current working directory as seen previously

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\pythonbc\\Draft1'

In [18]:
os.chdir('../data/') # changing our working directory to the data folder
os.getcwd()

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\pythonbc\\data'

In [19]:
iris_df3 = pd.read_csv('iris.csv')
iris_df3

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,se
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [20]:
# remember to change back to your original working directory

os.chdir('../draft1/')
os.getcwd()

'c:\\Users\\kimiy\\.vscode\\extensions\\sourcery.sourcery-1.19.0\\pythonbc\\draft1'

## Summarizing and describing Data ✨

Descriptive statistics include:
* Measures of central tendency (center position of a distribution for a data set)
    * mean, median and mode
* Measures of variability or dispersion (spread of distribution)
    * variance or standard deviation, coefficient of variation, minimum and maximum values, IQR (Interquartile Range), skewness and kurtosis

In [21]:
iris_df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [22]:
iris_df.describe() # excludes the character columns and gives summary statistics of numeric columns only

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [23]:
iris_df.describe(include=['object']) # gives summary statistics of the character columns.

Unnamed: 0,species
count,150
unique,4
top,virginica
freq,50


In [24]:
iris_df.describe(include='all')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,4
top,,,,,virginica
freq,,,,,50
mean,5.843333,3.054,3.758667,1.198667,
std,0.828066,0.433594,1.76442,0.763161,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


In [25]:
def stat_descriptor(dataframe, field):
    print(field + " minimum value :" + str(dataframe[field].min()) + " cm") 
    print(field + " maximum value :" + str(dataframe[field].max()) + " cm") 
    print(field + " range :" + str(dataframe[field].max() - dataframe[field].min()) + " cm") 
    print(field + " variance :" + str(dataframe[field].var()) + " cm") 
    print(field + " standard deviation :" + str(dataframe[field].std()) + " cm") 
    print(field + " variance :" + str(dataframe[field].var()) + " cm") 
    print(field + " Q1 :" + str(dataframe[field].quantile(0.25)) + " cm") 
    print(field + " Q2 :" + str(dataframe[field].quantile(0.5)) + " cm") 
    print(field + " Q3 :" + str(dataframe[field].quantile(0.75)) + " cm") 
    print(field + " IQR :" + str(dataframe[field].quantile(0.75) - dataframe[field].quantile(0.25)) + " cm")
    print(field + " skewness :" + str(dataframe[field].skew()) + " cm") 
    print(field + " kurtosis :" + str(dataframe[field].kurt()) + " cm") 

In [26]:
stat_descriptor(iris_df, 'petal_width')

petal_width minimum value :0.1 cm
petal_width maximum value :2.5 cm
petal_width range :2.4 cm
petal_width variance :0.582414317673378 cm
petal_width standard deviation :0.7631607417008411 cm
petal_width variance :0.582414317673378 cm
petal_width Q1 :0.3 cm
petal_width Q2 :1.3 cm
petal_width Q3 :1.8 cm
petal_width IQR :1.5 cm
petal_width skewness :-0.10499656214412734 cm
petal_width kurtosis :-1.3397541711393433 cm


| **_Statistic_**       | **_Function_**  | **_Measures_**                                                 | notes                                                                                                                                                                                                                                          |
|-----------------------|-----------------|----------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Mean                  | .mean()         | the average value of a distribution of figures                 |                                                                                                                                                                                                                                                |
| Median                | .median()       | measure of centre of a dataset                                 |                                                                                                                                                                                                                                                |
| Mode                  | .mode()         |  the value that appears most frequently in a data set.       |                                                                                                                                                                                                                                                |
| Minimum value         | .min()          |                                                                |                                                                                                                                                                                                                                                |
| Maximum value         | .max()          |                                                                |                                                                                                                                                                                                                                                |
| Range                 | Max - Min       |                                                                |                                                                                                                                                                                                                                                |
| Variance              | .var()          | degree of spread in your dataset                               | In relation to the mean, larger variance = more spread data                                                                                                                                                                                    |
| Standard deviation    | .std()          | how disperse the data is in relation to the mean               | larger st.dev = lots of variance in the observed data around the mean                                                                                                                                                                          |
| Q1 or 25th percentile | .quantile(0.25) | value under which 25% of the data points are found (inc order) |                                                                                                                                                                                                                                                |
| Q2 or 50th percentile | .quantile(0.5)  | value under which 50% of the data points are found (inc order) |                                                                                                                                                                                                                                                |
| Q3 or 75th percentile | .quantile(0.75) | value under which 75% of the data points are found (inc order) |                                                                                                                                                                                                                                                |
| IQR                   | Q3 - Q1         | the spread of data                                             | larger spread of data = more variability                                                                                                                                                                                                       |
| Skewness              | .skew()         | asymmetry of data distribution                                 | -1 to -0.5 skewness = negatively skewed (most values are found on the right side of the mean) ** -0.5 to 0.5 = distribution of values is almost symmetrical  ** 0.5 to 1 = positively skewed ** lower than -1 or higher than 1 = highly skewed |
| Kurtosis              | .kurt()         | heaviness of the distribution tails                            | positive kurtosis = distribution more peaked than normal (normal dist.) ** negative kurtosis = shape flatter than normal (normal dist.) ** kurtosis greater than +2 = distribution is too peaked (thin bell)                                   |

### Covariance:
* measure of the linear association between two variables (measures how changes in one variable are associated with changes in another variable)
* positive =  variables tend to move in the same direction
* does not provide a standardized measure and is sensitive to the units of the variables

### Correlation:
* standardized measure of the linear relationship between two variables
* quantifies the strength and direction of the relationship on a scale from -1 to +1
* +1 indicates a perfect positive linear relationship
* -1 indicates a perfect negative linear relationship
* 0 indicates no linear relationship
* unitless and not affected by the scale or units of the variable (easier to compare strength of relationship

In [27]:
import numpy as np

np.shape(iris_df)
lengths_array = np.array([iris_df['petal_length'],iris_df['sepal_length']])

np.cov(lengths_array)

array([[3.11317942, 1.27368233],
       [1.27368233, 0.68569351]])

covariance between petal length and sepal length is 1.27 indicating that it is strongly positive. So if the petal length is higher, we would expect the sepal length to be higher. 

In [28]:
np.corrcoef(lengths_array)

array([[1.        , 0.87175416],
       [0.87175416, 1.        ]])