# Data Analysis with Python

## Learning Objectives

- Types of Data
- Population and Sample and its characteristics
- Arrays and Dataframes - accessing data and performing operations on data
- Types of Variables - features and labels
- Descriptive Statistics and EDA
    - Univariate Analysis
    - Bivariate Analysis
    - Handling Null Values
    - Handling Outliers
    - Hypothesis Testing
- Introduction to Machine Learning and its types
- Linear Regression
- Logistic Regression

## Types of data 

#### Numerical (Quantitative) Data:

Numerical data consists of numbers and is measured on a continuous or discrete scale.
- Continuous numerical data can take any value within a range (e.g., height, temperature).
- Discrete numerical data can take only specific, distinct values (e.g., number of siblings, number of cars).

#### Categorical (Qualitative) Data:

Categorical data represents categories or labels and is not inherently numerical.
- Nominal categorical data has categories with no inherent order or ranking (e.g., gender, eye color).
- Ordinal categorical data has categories with a specific order or ranking (e.g., education level, socioeconomic status).

## Population and Sample

#### Population:

The population is the entire group or set of individuals, items, or elements that you are interested in studying and drawing conclusions about. It represents the complete set of possible observations that share a common characteristic or attribute.
For example, if you are studying the heights of all adult males in a country, the population would consist of the heights of all adult males in that country.

#### Sample:

A sample is a subset or a smaller representative group selected from the population.
It is used to make inferences or draw conclusions about the population without having to collect data from every individual in the population.

The process of selecting a sample from the population is known as sampling.
For example, instead of measuring the heights of all adult males in a country (which may be impractical or too costly), you might select a random sample of adult males and measure their heights to estimate the average height of the entire population.

#### Key points to note about population and sample:

- The population represents the entire group under study, while the sample represents a subset of that group.
- In many cases, it is not feasible or practical to collect data from the entire population, so researchers use samples to make inferences about the population.
- The goal of sampling is to obtain a representative sample that accurately reflects the characteristics of the population.
Statistical techniques are used to analyze sample data and make generalizations or predictions about th##e population.

#### Characteristics of Population

- Mean (Average): The mean of a population is the average value of a quantitative variable across all individuals in the population. It represents the central tendency of the population distribution.

- Median: The median of a population is the middle value of a quantitative variable when all observations are arranged in ascending order. It is another measure of central tendency that is less affected by extreme values (outliers) compared to the mean.

- Mode: The mode of a population is the most frequently occurring value or category of a variable. It represents the value with the highest frequency in the population distribution.

- Variance: The variance of a population measures the spread or dispersion of values around the mean. It quantifies the average squared deviation of individual observations from the mean.

- Standard Deviation: The standard deviation of a population is the square root of the variance. It provides a measure of the average distance between individual observations and the mean.

- Range: The range of a population is the difference between the maximum and minimum values of a variable. It provides a simple measure of the spread of values in the population.

- Distribution: The distribution of a population describes how the values of a variable are spread or distributed across the population. Common types of distributions include normal (bell-shaped), skewed (asymmetric), and ##uniform (evenly distributed).

#### Characteristics of a Sample

- Sample Size: The sample size is the number of observations or individuals included in the sample. It represents the amount of data available for analysis and inference.
- Sampling Method: The sampling method describes how the sample was selected from the population. Common sampling methods include simple random sampling, stratified sampling, cluster sampling, and systematic sampling.

- Descriptive Statistics: Descriptive statistics summarize the main features of the sample data. Common descriptive statistics include measures of central tendency (mean, median, mode), measures of dispersion (range, variance, standard deviation), and measures of shape (skewness, kurtosis).

- Sample Proportion: The sample proportion represents the fraction or percentage of observations with a specific attribute or characteristic in the sample. It provides insights into the relative frequency of different categories in the sample.

- Confidence Interval: The confidence interval is a range of values that is likely to contain the true population parameter with a certain level of confidence. It is used to estimate the precision or uncertainty of sample statistics, such as the sample mean or proportion.

- Sampling Bias: Sampling bias refers to the systematic distortion or deviation of the sample from the population due to the sampling method used. It can affect the representativeness and generalizability of the sample data to the population. the sample data to the population. distributed).

<hr><hr>

## Arrays and Dataframes

In [None]:
import numpy as np  # has all functions required for basic numeric calculations
import pandas as pd  # has Dataframe object and is used to store and manipulate data
import matplotlib.pyplot as plt  # used for data visualization (base library)
import seaborn as sns  # used for data viz (wrapper over matplotlib)
import scipy.stats as stats  # for statistical functions

# Optional Parameters -
%matplotlib inline
pd.options.display.float_format = '{:.2f}'.format
plt.rcParams['figure.figsize'] = (4, 3)
plt.rcParams['font.size'] = 10

import warnings
warnings.filterwarnings('ignore')

### Examples on arrays

In [None]:
names = np.array(['Claire', 'Darrin', 'Sean', 'Brosina', 'Andrew', 'Irene', 'Harold', 'Pete', 'Alejandro', 'Zuschuss', 'Ken', 'Sandra', 'Emily', 'Eric','Tracy', 'Matt', 'Gene', 'Steve', 'Linda', 'Ruben', 'Erin', 'Odella', 'Patrick', 'Lena', 'Darren', 'Janet', 'Ted', 'Kunst', 'Paul', 'Brendan'])
ages = np.array([35, 26, 36, 44, 33, 33, 23, 22, 35, 53, 25, 41, 24, 31, 34, 43, 22, 29, 35, 25, 47, 22, 26, 52, 25, 25, 27, 24, 27, 30])
salary = np.array([ 88962,  67659, 117501, 149957,  32212,  63391,  14438,  22445, 72287, 195588,  17240, 115116,  18027,  55891, 109132,  83327, 22125,  29324,  54003,  18390, 141401,  19593,  57093, 130556, 22093,  13058,  26180,  23259,  34248,  27416])
designation = np.array(['Manager', 'Team Lead', 'Manager', 'Senior Manager', 'Team Lead', 'Team Lead', 'Developer', 'Developer', 'Team Lead', 'Managing Director', 'Developer', 'Manager', 'Developer','Team Lead', 'Manager', 'Manager', 'Developer', 'Team Lead','Team Lead', 'Developer', 'Senior Manager', 'Developer', 'Team Lead', 'Senior Manager', 'Developer', 'Developer','Team Lead', 'Developer', 'Team Lead', 'Team Lead'])

###### Ex. How many employees are present in the array

In [None]:
len(names)

In [None]:
names.size

In [None]:
names.dtype

In [None]:
ages.dtype

In [None]:
ages.ndim  # dimension of array 

###### Extract the ages greater than 40

In [None]:
ages[ages > 40]

###### Extract names of the employees whose age greater than 40

In [None]:
names[ages > 40]

###### Apply 7% hike on salary to all employes in the array

In [None]:
salary * 1.07

###### Apply 7% hike to all employes whose age is greater_eq than 40 and 10% hike to employees whose age is less than 40

In [None]:
np.where(ages >= 40, salary*1.07, salary * 1.10)

###### Generate 10 random numbers between 10-50

In [None]:
arr1 = np.random.randint(10, 50, 10)
arr1

In [None]:
arr1 = np.random.randint(10, 50, (5, 2))
arr1

###### Find unique elements from arr1

In [None]:
np.unique(arr1)

###### Find the elements of arr1 also present in arr2

In [None]:
arr1 = np.random.randint(10, 50, 10)
arr1

In [None]:
arr2 = np.random.randint(10, 50, 10)
arr2

In [None]:
np.intersect1d(arr1, arr2)

###### Are all elements of arr1 present in arr2

In [None]:
np.all(np.in1d(arr1, arr2))  # Returns True if all elements in the array are True

In [None]:
np.any(np.in1d(arr1, arr2))  # Returns True if any one element in the array is True

###### Ex. Create age groups based on 3 criterions

In [None]:
conditions = [ages < 35, ages < 50]
groups = ["Group 1", "Group 2"]
np.select(conditions, groups, "Group 3")

### Examples on Dataframes

In [None]:
df = pd.DataFrame({"Name" : names, "Age":ages, "Salary" : salary, "Designation" : designation})
df.head()  # displays only top N rows, default = 5

In [None]:
df.shape

In [None]:
# No of rows
df.shape[0]

In [None]:
df.dtypes

###### Ex. Display all the Developers in the DF

In [None]:
df[df["Designation"] == "Developer"] # Filtering the dataframe

In [None]:
df[df["Designation"] == "Developer"].Salary  # Filtering the dataframe

###### Ex. Display details of employees whose salary is in the range of 50k-80k

In [None]:
df[df.Salary.between(50000, 80000)]

###### Ex. Display employees in DESC sorted order of salaries 

In [None]:
df.sort_values("Salary", ascending=False) # Original df is not modified

In [67]:
df.sort_values("Salary", ascending=False, inplace=True) # to modify original df set inplace = True
df.head()

Unnamed: 0,Name,Age,Salary,Designation
9,Zuschuss,53,195588,Managing Director
3,Brosina,44,149957,Senior Manager
20,Erin,47,141401,Senior Manager
23,Lena,52,130556,Senior Manager
2,Sean,36,117501,Manager


###### Ex. Extract first 10 rows based on index positions

In [68]:
df.iloc[0:11]

Unnamed: 0,Name,Age,Salary,Designation
9,Zuschuss,53,195588,Managing Director
3,Brosina,44,149957,Senior Manager
20,Erin,47,141401,Senior Manager
23,Lena,52,130556,Senior Manager
2,Sean,36,117501,Manager
11,Sandra,41,115116,Manager
14,Tracy,34,109132,Manager
0,Claire,35,88962,Manager
15,Matt,43,83327,Manager
8,Alejandro,35,72287,Team Lead


#### Adding columns and dropping columns

###### Ex. Salary hike based on age

In [None]:
df.insert(2, "New Salary", np.where(df.Age >= 40, df.Salary * 1.07, df.Salary * 1.10))

In [70]:
df["New Salary"] = np.where(df.Age >= 40, df.Salary * 1.07, df.Salary * 1.10) 
df.head()

Unnamed: 0,Name,Age,Salary,Designation,New Salary
9,Zuschuss,53,195588,Managing Director,209279.16
3,Brosina,44,149957,Senior Manager,160453.99
20,Erin,47,141401,Senior Manager,151299.07
23,Lena,52,130556,Senior Manager,139694.92
2,Sean,36,117501,Manager,129251.1


In [71]:
df.drop(columns=["Salary"])  # original d is not modified (inplace = True)

Unnamed: 0,Name,Age,Designation,New Salary
9,Zuschuss,53,Managing Director,209279.16
3,Brosina,44,Senior Manager,160453.99
20,Erin,47,Senior Manager,151299.07
23,Lena,52,Senior Manager,139694.92
2,Sean,36,Manager,129251.1
11,Sandra,41,Manager,123174.12
14,Tracy,34,Manager,120045.2
0,Claire,35,Manager,97858.2
15,Matt,43,Manager,89159.89
8,Alejandro,35,Team Lead,79515.7


#### Operations on DF

In [85]:
num_data = pd.DataFrame({"Name" : ["Jane", "Sam", "Bill"],
                         "English" :[10, 20, 50], "Maths":[40, 40, 60], "Science" : [30, 20, 80]})
num_data

Unnamed: 0,Name,English,Maths,Science
0,Jane,10,40,30
1,Sam,20,40,20
2,Bill,50,60,80


###### Ex. Calculate Total Marks and Percentage

In [86]:
num_data["English"] + num_data["Maths"] + num_data["Science"]

0     80
1     80
2    190
dtype: int64

In [88]:
num_data.select_dtypes("number").sum()  # Column wise sum

English     80
Maths      140
Science    130
dtype: int64

In [87]:
num_data.select_dtypes("number").sum(axis = 1) # row wise sum

0     80
1     80
2    190
dtype: int64

In [89]:
num_data["Total"] = num_data.select_dtypes("number").sum(axis = 1) # row wise sum
num_data

Unnamed: 0,Name,English,Maths,Science,Total
0,Jane,10,40,30,80
1,Sam,20,40,20,80
2,Bill,50,60,80,190


In [91]:
num_data["Percentage"] = (num_data["Total"] / 3) 
num_data

Unnamed: 0,Name,English,Maths,Science,Total,Percentage
0,Jane,10,40,30,80,26.67
1,Sam,20,40,20,80,26.67
2,Bill,50,60,80,190,63.33


In [80]:
num_data.select_dtypes("object")

Unnamed: 0,Name
0,Jane
1,Sam
2,Bill


#### map(), replace() and apply()

In [None]:
new_values = {'Managing Director' : "MD", 'Senior Manager' : "SM", 'Manager' : "MN", 
              'Team Lead' : "TL", 'Developer' : "DL"}


In [None]:
# map using function object
new_values = {'Senior Manager' : "SM", 'Manager' : "MN", 'Team Lead' : "TL", 'Developer' : "DL"}


#### np.where() and np.select()

##### Note - Be careful with the objects returned by np.where and np.select. Returns an array object

###### Assign grades based on marks

<hr><hr>

<br><br><br><br><br><br><br><br><br><br><br><br><br>