## Data Analytics Basics Tutorial

### 1. Key Libraries for Data Analysis

#### 1.1 Pandas, a Python library, provides data structures and operations for manipulating numerical tables and time series. Key features include:

- **Data structures for efficient data manipulation.**

In [None]:
# Flag to determine the source of the file
use_local_file = False # Change to True if you want to use a local file

In [None]:
import pandas as pd

if use_local_file:
  # Import file from a local file
  df = pd.read_csv('sample_data.csv')
else:
  # Import file from GitHub raw URL
  url = 'https://raw.githubusercontent.com/peyrone/PyAirPollution/main/sample_data.csv'
  df = pd.read_csv(url)

In [None]:
# Add a 'Salary' column
df['Salary'] = [70000, 80000, 90000]

# Add a 'Date' column in the format 'YYYY-MM-DD'
df['Date'] = pd.to_datetime(['2021-01-01', '2021-01-02', '2021-01-03'])

# Set the 'Date' column as the index of the DataFrame
df.set_index('Date', inplace=True)

# Display the DataFrame
print(df)

- **Easy handling of missing data.**

In [None]:
# Remove rows with missing values from the DataFrame
df.dropna(inplace=True)

# Display the DataFrame
print(df)

- **Time series functionality.**

In [None]:
# Convert daily data to monthly averages by resampling
monthly_resampled_data = df.resample('M').mean()

# Filter the DataFrame based on a time period (e.g., after January 2021)
filtered_df = df[df.index > pd.to_datetime('2021-01-01')]

print("\nMonthly Resampled Data:\n", monthly_resampled_data)
print("\nFiltered DataFrame:\n", filtered_df)

#### 1.2 NumPy is a Python package used for scientific computing, which provides multi-dimensional arrays for numerical operations. Key features include:

- **Creating and Manipulating a Multidimensional Array**

In [None]:
import numpy as np

# Create a 2D array (3x3 matrix)
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Reshape the array to 1x9
reshaped_array = array_2d.reshape(1, 9)

print("Original Array:\n", array_2d)
print("Reshaped Array:\n", reshaped_array)

- **Basic Mathematical Operations on Arrays**

In [None]:
# Create two arrays
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

# Perform element-wise operations
addition = array1 + array2
subtraction = array1 - array2
multiplication = array1 * array2

print("Addition:", addition)
print("Subtraction:", subtraction)
print("Multiplication:", multiplication)

- **Advanced Mathematical Operations**

In [None]:
# Create two 2D arrays (matrices)
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# Matrix multiplication
matrix_product = np.dot(matrix1, matrix2)

# Compute the determinant of a matrix
determinant = np.linalg.det(matrix1)

print("Matrix Product:\n", matrix_product)
print("Determinant of the first matrix:", determinant)

#### 1.3 Matplotlib is a Python library used for showing data through static, animated, and interactive charts and plots. Key features include:

- **Static Visualization**

In [None]:
import matplotlib.pyplot as plt

# Generate a sine wave
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Plotting
plt.plot(x, y)
plt.title("Sine Wave")
plt.xlabel("x")
plt.ylabel("sin(x)")
plt.show()

- **Animated Visualization**

In [None]:
import matplotlib.animation as animation

# Generating sine wave data
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)

fig, ax = plt.subplots()
line, = ax.plot(x, y)

def animate(i):
  line.set_ydata(np.sin(x + i / 10.0))  # update the data
  return line,

# Create animation
ani = animation.FuncAnimation(fig, animate, interval=50, blit=True)

plt.show()

- **Interactive Visualization**

In [None]:
fig, ax = plt.subplots()
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
line, = ax.plot(x, y)

def onclick(event):
    y = np.sin(x + event.xdata)
    line.set_ydata(y)
    fig.canvas.draw()

fig.canvas.mpl_connect('button_press_event', onclick)
plt.show()

### 2. Data Analysis Process

This example analyzes the Pima Indians Diabetes Database, which includes health data from Pima Indian women near Phoenix, Arizona, USA, a group with a high incidence of type 2 diabetes. The dataset includes:

- **preg:** Number of pregnancies.
- **plas:** Plasma glucose concentration (a key indicator for diabetes).
- **pres:** Blood pressure (mm Hg).
- **skin:** Skinfold thickness (mm) at the triceps, used to estimate body fat.
- **insu:** 2-hour serum insulin level (mu U/ml).
- **mass:** Body Mass Index (BMI), a measure of body fat based on height and weight.
- **pedi:** Diabetes pedigree function, indicating genetic predisposition to diabetes.
- **age:** Age in years.
- **class:** Indicates if the patient has diabetes (1) or not (0).

#### 2.1 Data importation

In [None]:
from sklearn.datasets import fetch_openml

diabetes_data = fetch_openml(name='diabetes', version=1, as_frame=True)
df = diabetes_data.frame

# Print the column names
print(df.columns)

#### 2.2 Data cleaning and preprocessing

In [None]:
# Replace zeros with NaN in columns where zero is not a valid value
for col in ['plas', 'pres', 'skin', 'insu', 'mass']:
  df[col].replace(0, pd.NA, inplace=True)

# Fill missing values with mean of the column
df.fillna(df.mean(), inplace=True)

#### 2.3 Data exploration

**Typical insights from this dataset usually include:**

- **Distribution of Diagnostic Measures:** Analysis often reveals the distribution of key measures like plasma glucose, BMI, and age, which can help in understanding the general health profile of the population.

- **Correlation between Variables:** By examining correlations, one can identify which factors are more strongly associated with diabetes. For instance, higher plasma glucose levels might show a stronger correlation with diabetes outcomes.

- **Age and Diabetes:** Age distribution might reveal specific age groups at higher risk of diabetes.

- **Missing Data Impact:** The approach to handling missing data (like replacing zeros with the mean) can significantly affect the analysis outcomes, highlighting the importance of robust data preprocessing.

- **Predictive Modeling:** The dataset is commonly used to build predictive models to identify individuals at higher risk of developing diabetes, based on their medical measurements.

In [None]:
# Summary statistics and information
print(df.describe())
print(df.info())

In [None]:
# Count the instances of each class
class_counts = df['class'].value_counts()
print(class_counts)

**Understanding Correlation Values:**

- Correlation coefficients range from -1 to 1.
- A value close to 1 implies a strong positive correlation (as one variable increases, so does the other).
- A value close to -1 implies a strong negative correlation (as one variable increases, the other decreases).
- A value around 0 suggests no correlation.

In [None]:
# Correlation Analysis
correlation_matrix = df.corr()
print("\nCorrelation Matrix:\n", correlation_matrix)

**Interpreting Each Pair:**

- Older age links to more pregnancies.
- Higher blood glucose relates to higher insulin levels.
- Thicker skin folds align with a higher body mass index.
- Increased body mass is associated with higher blood pressure.
- Diabetes pedigree function doesn't strongly relate to blood pressure, body mass, or age.

#### 2.4 Data visualization

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install seaborn

In [None]:
# Heatmap of Correlation Matrix
import seaborn as sns

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Histogram of age
plt.figure(figsize=(8, 6))
df['age'].hist(bins=15)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Scatter plot of Plasma Glucose vs. BMI colored by Outcome
plt.figure(figsize=(8, 6))
groups = df.groupby('class')
for name, group in groups:
  plt.scatter(group['plas'], group['mass'], label=f'Class {name}')
plt.legend()
plt.title('Plasma Glucose vs. BMI by Outcome')
plt.xlabel('Plasma Glucose')
plt.ylabel('BMI')
plt.show()