# An introduction to Machine Learning
Machine learning is a branch of artificial intelligence that enables algorithms to uncover hidden patterns within datasets, allowing them to make predictions on new, similar data without explicit programming for each task. Traditional machine learning combines data with statistical tools to predict outputs, yielding actionable insights. This technology finds applications in diverse fields such as image and speech recognition, natural language processing, recommendation systems, fraud detection, portfolio optimization, and automating tasks.

![image.png](attachment:b4265a39-6c8e-4268-b3e5-c299c236198d.png)

# Machine Learning lifecycle:
The lifecycle of a machine learning project involves a series of steps that include: 

1. <b>Study the Problems:</b>
The first step is to study the problem. This step involves understanding the business problem and defining the objectives of the model. 

2. <b>Data Collection:</b>
When the problem is well-defined, we can collect the relevant data required for the model. The data could come from various sources such as databases, APIs, or web scraping. 

3. <b>Data Preparation:</b>
When our problem-related data is collected. then it is a good idea to check the data properly and make it in the desired format so that it can be used by the model to find the hidden patterns. This can be done in the following steps: 
* Data cleaning
* Data Transformation
* Explanatory Data Analysis
* Feature Engineering & Selection
* Split the dataset for training and testing. 

4. <b>Model Selection:</b>
The next step is to select the appropriate machine learning algorithm that is suitable for our problem. This step requires knowledge of the strengths and weaknesses of different algorithms. Sometimes we use multiple models and compare their results and select the best model as per our requirements.

5. <b>Model building and Training:</b>
After selecting the algorithm, we have to build the model. 
In the case of traditional machine learning building mode is easy it is just a few hyperparameter tunings. 
In the case of deep learning, we have to define layer-wise architecture along with input and output size, number of nodes in each layer, loss function, gradient descent optimizer, etc.
After that model is trained using the preprocessed dataset.

6. <b>Model Evaluation:</b>
Once the model is trained, it can be evaluated on the test dataset to determine its accuracy and performance using different techniques. like classification report, F1 score, precision, recall, ROC Curve, Mean Square error, absolute error, etc. 

7. <b>Model Tuning:</b>
Based on the evaluation results, the model may need to be tuned or optimized to improve its performance. This involves tweaking the hyperparameters of the model. 

8. <b>Deployment:</b>
Once the model is trained and tuned, it can be deployed in a production environment to make predictions on new data. This step requires integrating the model into an existing software system or creating a new system for the model. 

9. <b>Monitoring and Maintenance:</b>
Finally, it is essential to monitor the model’s performance in the production environment and perform maintenance tasks as required. This involves monitoring for data drift, retraining the model as needed, and updating the model as new data becomes available.

# Python Essential Libraries
<b>NumPy:</b> This library is fundamental for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays.

<b>Pandas:</b> Essential for data manipulation and analysis, Pandas provides data structures and operations for manipulating numerical tables and time series. It is ideal for data cleaning, transformation, and analysis.

<b>Matplotlib:</b> It is great for creating static, interactive, and animated visualizations in Python. Matplotlib is highly customizable and can produce graphs and charts that are publication quality.

<b>Scikit-learn:</b> Perhaps the most well-known Python library for machine learning, Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface. It includes methods for classification, regression, clustering, and dimensionality reduction, as well as tools for model selection and evaluation.

<b>SciPy:</b> Built on NumPy, SciPy extends its capabilities by adding more sophisticated routines for optimization, regression, interpolation, and eigenvector decomposition, making it useful for scientific and technical computing.

<b>TensorFlow:</b> Developed by Google, TensorFlow is primarily used for deep learning applications. It allows developers to create large-scale neural networks with many layers, primarily focusing on training and inference of deep neural networks.

# 1. Study the Problems

In [2]:
import pandas as pd

df = pd.read_excel('AttritionCaseStudy.xlsx')

In [3]:
df.head()

Unnamed: 0,Attrition,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,41,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,0,49,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,1,37,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,0,33,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,0,27,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [6]:
df['Department'].unique()

array(['Sales', 'Research & Development', 'Human Resources'], dtype=object)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Attrition                 1470 non-null   int64 
 1   Age                       1470 non-null   int64 
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [None]:
# Setup Jupyter Lab
# Setup Python Venv
# DataFrame
# Types of Data
# Independent Variable or Y variable or Predicting Variable

# X1, X2, X3, ......, Xn = Y (1000 rows)

# 800 rows = Train X1, X2, X3, ......, Xn = Y
# 200 rows = Test X1, X2, X3, ......, Xn = ~Y

# Y- ~Y
