<a href="https://colab.research.google.com/github/passuony/0-hello-world/blob/master/data_science_job_salary_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:90%;text-align:center;border-radius:10px; border: 2px solid #FFA500; padding: 10px;">Data Science Salary Prediction </p>





In this project, I will be using  Machine learning Models such as `Multiple Linear Regression, Lasso Regression and Random Forest` on my own dataset `Data Science Jobs & Salaries` to predict salaries, jobs and more. I will use `GridSearchCV` for tuning the model and `Test Ensembling Techniques`.

<a id='top'></a>
<p style="background-color:#6A5ACD;font-family:Tahoma, Geneva, sans-serif;color:#FFFFFF;font-size:150%;text-align:center;border-radius:10px;padding:10px;">  Learn About Data  </p>

### `About Dataset`

The Data in the dataset is extracted from the Glassdoor website, which is a job posting website. The dataset has data related to data science jobs and salaries and a lot more, offering a clear view of job opportunities. It is packed with essential details like job titles, estimated salaries, job descriptions, company ratings, and key company info such as location, size, and industry. Whether you're job hunting or researching, this dataset helps you understand the job market easily. Start exploring now to make smart career choices!".

Perfect for adding to your Kaggle notebooks, our dataset is a treasure trove for analyzing all kinds of job-related info. Whether you're curious about salary trends or want to find the best-rated companies, this dataset has you covered. It's great for beginners and experts alike, offering lots of chances to learn and discover. You can use it to predict things or find hidden patterns—there's so much you can do! So, get ready to explore the world of jobs with our easy-to-use dataset on Kaggle.

>### [MY DATASET LINK](https://www.kaggle.com/datasets/fahadrehman07/data-science-jobs-and-salary-glassdoor)

<h4 style = "color:orange">Columns in Dataset:</h4>

1. `**Job Title:**`                           _Title of the Job_
2. `**Salary Estimate:**`	        _Estimated salary for the job that the company provides_
3. `**Job Description:**`	        _The description of the job_
4. `**Rating:**`                              _Rating of the company_
5. `**Company Name:**`               _Name of the Company_
6. `**Location:**`                           _Location of the job_
7. `**Headquarters:**`                   _Headquarters of the company_
8. `**Size:**`                                  _Number of employees in the company_
9.  `**Founded:**`                         _The year company founded_
10. `**Type of ownership:**`        _Ownership types like private, public, government, and non-profit organizations_
11. `**Industry:**`                         _Industry type like `Aerospace, Energy` where the company provides services_
12. `**Sector:**`                           _Which type of services company provide in the industry, like industry (Energy), Sector (Oil, Gas)_
13. `**Revenue:**`                       _Total revenue of the company_
14. `**Competitors:**`                  _Company competitors_


<!-- .......................................................................................................................... -->

<p style="background-color:#20B2AA;font-family:Arial, sans-serif;color:#FFFFFF;font-size:150%;text-align:center;border-radius:10px;padding:10px;">  Life Cycle Of Machine Learning Project  </p>


   <ul style="font-size: 18px; font-family: 'Segoe UI';">
        <li><strong>Understanding the Problem Statement</strong></li>
        <li><strong>Data Checks to Perform</strong></li>
        <li><strong>Exploratory Data Analysis</strong></li>
        <li><strong>Data Pre-Processing</strong></li>
        <li><strong>Model Training</strong></li>
        <li><strong>Choose Best Model</strong></li>
        <li><strong>Model Tuning</strong></li>
        <li><strong>Test Ensembling</strong></li>
        <li><strong>Putting Model into Productino</strong></li>
    </ul>


<!-- >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -->




<a id='top'></a>

<p style="background-color: #4cbb17; font-family: Arial, sans-serif; color: #ffffff; font-size: 24px; text-align: center; padding: 10px; border-radius: 10px;"> TABLE OF CONTENTS </p>
   
    
* [1. IMPORTING LIBRARIES](#1)
    
* [2. LOADING DATA](#2)

* [4. DATA CLEANING](#data_cleaning)

* [3. Exploratory Data Analysis](#EDA)
    
* [5. Model Building](#ModelBuilding)   
    
* [6. TUNING BY GRIDSEARCHCV](#Tuning)
      
* [7. TEST ENSEMBLING](#ensembles)
    
* [8. PUTTING MODEL INTO PRODUCTION](#production)

<!-- ................................................................................................................................................. -->






<a id='1'></a>

# <p style="background-color:#682F2F;font-family:newtimeroman;color:#FFF9ED;font-size:80%;text-align:center;border-radius:10px; "><b>1|</b>  IMPORTING LIBRARIES </p>

<a id='2'></a>

# <p style="background-color: #20B2AA; font-family: Arial, sans-serif; color: #FFFFFF; font-size: 80%; text-align: center; border-radius: 10px; padding: 15px; "><b>2|</b>  Data Checks to Perform </p>




In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# For Word Analysis
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')

import warnings
warnings.filterwarnings('ignore')

#Importing Machine learning Models
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

import pickle

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.



  <ul style="border: 2px solid #4CAF50; border-radius: 8px; margin-top: 10px; width:57%">
    <li>Check Missing values</li>
    <li>Check Duplicates</li>
    <li>Check data type</li>
    <li>Check the number of unique values of each column</li>
    <li>Check statistics of the dataset</li>
    <li>Check various categories present in the different categorical columns</li>
  </ul>


<a id='data_cleaning'></a>
# <p style="background-color:#FF6347;font-family:Verdana, sans-serif;color:#FFFFFF;font-size:90%;text-align:center;border-radius:10px;padding:10px;"><b>3|</b>  Data Cleaning  </p>



In [None]:
df.columns

In [None]:
df.shape

In [None]:
pd.set_option('display.max_columns', None)
df.head(10)




#### By looking into the scraped data we will do the following tasks.

<div style="color: #1E90FF; display: inline-block; border-radius: 10px; background-color: #F0F8FF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px groove #1E90FF; width:50%;">
    <p style="padding: 15px; color: #1E90FF; overflow: hidden; font-size: 24px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b> Tasks List:</b>
    </p>
</div>


<div style="border: 2px solid #1E90FF; border-radius: 10px; margin-top: 10px; width:50%">
    <ol style="list-style-type: none; padding: 10px;">
        <li>1. Renaming Columns.</li>
        <li>2. Salary Parsing.</li>
        <li>3. Company Name text only.</li>
        <li>4. State of Field.</li>
        <li>5. Age of Company.</li>
        <li>6. Parsing of job description (python, etc.)</li>
    </ol>
</div>


<div style="color: #FFA500; display: inline-block; border-radius: 15px; background-color: #FFEFD5; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px solid #FFA500;width:60%"><p style="padding: 15px; color: #FFA500; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750%;"><b>1. Renaming Columns:</b></p>
</div>



In [None]:
def title_simplifier(title):
    if 'data scientist' in title.lower():
        return 'data scientist'
    elif 'data engineer' in title.lower():
        return 'data engineer'
    elif 'analyst' in title.lower():
        return 'analyst'
    elif 'machine learning' in title.lower():
        return 'mle'
    elif 'manager' in title.lower():
        return 'manager'
    elif 'director' in title.lower():
        return 'director'
    else:
        return 'na'

def seniority(title):
    if 'sr' in title.lower() or 'senior' in title.lower() or 'lead' in title.lower() or 'principal' in title.lower():
        return 'senior'
    elif 'jr' in title.lower() or 'jr.' in title.lower():
        return 'jr'
    else:
        return 'na'

#job title and seniority

# Fix state Los Angeles

# JOb description in length

# Competitor count

# Hourly wage to annual



<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:60%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 2. Salary Parsing:</b>
    </p>
</div>

Removing the -1 in the salary estimate column


In [None]:
df = df[df['Salary Estimate']!= '-1']
df.head(10)

## `Removing the glassdoor est text in the salary column:`

In [None]:
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.split('(')[0])

df.head(10)


## `Removing the k and $ sign from the Salary Estimate:`

> Removing the k and $ sign from the salary column so that
we can predict or do analysis on numbers

In [None]:
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.replace('K','').replace('$',''))
df.head(10)


>The K and $ sign are removed from the column.<br>
Its a range of numbers we have to convert it into a single number so it can be used for analysis and prediction.


## `Removing Per hour & Employee provided salary text in Salary Estimated Column:`

In [None]:
df['PerHour'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['Employee'] = df['Salary Estimate'].apply(lambda x: 1 if 'employee provided salary:' in x.lower() else 0)

In [None]:
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.lower().replace('per hour', ''))
df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: x.lower().replace('employer provided salary:', ''))

In [None]:
pd.set_option('display.max_rows', 10)
df


>Now the per hour text and employee provided salary text is removed from the salary estimated column.<br>
Only the integer value is remained but there still a problem that the integer is in interval and we have to convert into a single integer.
................................................................................................................................................................................................................................................................


## `Splitting the Salary estimated column:`

- The first will be the minimum salary.<br>
- The second will be the Maximumn Salary.


In [None]:
df['Min_Salary'] = df['Salary Estimate'].apply(lambda x: int(x.split('-')[0]))
df['Max_Salary'] = df['Salary Estimate'].apply(lambda x: int(x.split('-')[1]))

- <b> Now two columns are created Min and Max salary.</b><br>
To get the single integer in Salary Estimate column we will take average of these two columns

In [None]:
df['Salary Estimate']= (df['Min_Salary'] + df['Max_Salary'])/2

df.head()


- Now the Salary Estimate column is completely clean and ready to use.<br>
NOTE: The datatype of values is float and if for some reason you want to convert in into integer datatype use the following line of code.

```
df['Salary Estimate']= (df['Min_Salary'] + df['Max_Salary'])//2
```

In [None]:



df.drop(['Min_Salary','Max_Salary'] , axis =1)



<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:50%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 3.Cleaning Company Name Column:</b>
    </p>
</div>

In [None]:
df.head(10)

In [None]:
df['Company Name'] = df.apply(lambda x: x['Company Name'] if x['Rating']<0 else x['Company Name'][:-1], axis = 1)


In [None]:
df['Company Name']= df['Company Name'].apply(lambda x: x.split('\n')[0])


<b>Now the Company Name column is cleaned and ready to use for EDA</b>

<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:50%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 4. State of Field Column Cleaning:</b>
    </p>
</div>

We can create the state column from the location column

In [None]:
df['State'] = df.Location.apply(lambda x: x.split(',')[1])


In [None]:
df.head(2)

In [None]:
# df= df.drop(['States'], axis = 1)
# df.head()

### `Lets see if the location & Headquarter is same`


In [None]:
df['Same State'] = df.apply(lambda x: 1 if x.Location==x.Headquarters else 0, axis =1)
df.head()

<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:50%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 5. Age of Company:</b>
    </p>
</div>

In [None]:
df['Age'] = df['Founded'].apply(lambda x: x if x<1 else 2023-x)


In [None]:
df.head(10)

<div style="color: #4CAF50; display: inline-block; border-radius: 8px; background-color: #E0FFFF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px outset pink; width:50%">
    <p style="padding: 15px; color: #4CAF50; overflow: hidden; font-size: 18px; letter-spacing: 1px; margin: 0; width: 750px;">
        <b> 6. Parsing the Job Description (python etc):</b>
    </p>
</div>

<div style="border: 2px solid #4CAF50; border-radius: 8px; margin-top: 10px; width:40%">
    <ul style="list-style-type: none; padding: 10px;">
        <li>python.</li>
        <li>R Studio.</li>
        <li>Spark.</li>
        <li>AWS.</li>
        <li>Excel.</li>
    </ul>
</div>


In [None]:
# for python
df['Python_yn'] = df['Job Description'].apply(lambda x: 1 if 'python' in x.lower() else 0)



df['R Studio'] = df['Job Description'].apply(lambda x: 1 if 'r studio' in x.lower() or 'r-studio' in x.lower() or 'r_studio' in x.lower() else 0)


# For Spark
df['Spark'] = df['Job Description'].apply(lambda x: 1 if 'spark' in x.lower() else 0)


# For AWS
df['AWS_yn'] = df['Job Description'].apply(lambda x: 1 if 'aws' in x.lower() else 0)


# For Excel
df['Excel_yn'] = df['Job Description'].apply(lambda x: 1 if 'excel' in x.lower() else 0)

In [None]:
df.head(10)

In [None]:
df.columns

In [None]:
#df = df.drop(['Unnamed: 0'], axis=1)
# columns = df.columns
list = ['Job Title', 'Salary Estimate', 'Job Description',
       'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'PerHour', 'Employee', 'Min_Salary', 'Max_Salary', 'State',
       'Same State', 'Age', 'Python_yn', 'R Studio', 'Spark', 'AWS_yn',
       'Excel_yn']
# CleanedData= pd.DataFrame()
df=df[list]

In [None]:
pd.set_option('display.max_columns', None)
df.head()

In [None]:


df['Job_simp'] = df['Job Title'].apply(title_simplifier)

In [None]:


df.Job_simp.value_counts()

In [None]:
df['seniority'] = df['Job Title'].apply(seniority)
df.seniority.value_counts()

In [None]:
# Fix state los Angeles

df['job_state'] = df.State.apply(lambda x: x.strip() if x.strip().lower() != 'los angeles' else 'CA')
df.State.value_counts()

In [None]:
df.columns

In [None]:
# job descrition length

df['desc_len'] = df['Job Description'].apply(lambda x: len(x))

In [None]:
df.desc_len

## `- Competitor Count:`



In [None]:
# competitors count
df['Num_comp'] = df['Competitors'].apply(lambda x: len(x.split(',')) if x != '-1' else 0)

In [None]:
df[['Competitors','Num_comp']]

## `- Hourly Wage to annual`

In [None]:
# Hourly wage to annual
df['Min_Salary'] = df.apply(lambda x: x.Min_Salary*2 if x.PerHour==1 else x.Min_Salary, axis =1)
# Hourly wage to annual
df['Man_Salary'] = df.apply(lambda x: x.Max_Salary*2 if x.PerHour==1 else x.Max_Salary, axis =1)



In [None]:
df[df.PerHour ==1][['PerHour','Min_Salary','Max_Salary']]

In [None]:
df.head()

> ### Now Everything is ready for EDA.

<a id='EDA'></a>
# <p style="background-color:#FF6347;font-family:Verdana, sans-serif;color:#FFFFFF;font-size:90%;text-align:center;border-radius:10px;padding:10px;"><b>4|</b>  Exploratory Data Analysis  </p>



In [None]:
df.describe()

In [None]:
df.columns

In [None]:
sns.set(style="whitegrid")
sns.set_palette("viridis")

df.Rating.hist()

![image.png](1.png)

In [None]:
sns.set(style="whitegrid")
sns.set_palette("viridis")

df['Salary Estimate'].hist()

![image.png](2.png)

In [None]:
sns.set(style="whitegrid")
sns.set_palette("viridis")

df.Age.hist()

![image.png](3.png)

In [None]:
sns.set(style="whitegrid")
sns.set_palette("viridis")

df.desc_len.hist()

![image.png](44.png)

## `Checking Outliers by boxplot`

<b>Columns:</b>

1. Salary Estimate.
2. Age.
3. desc_len.
4. Rating.

In [None]:
# Define your custom color palette
custom_palette = ["#1f77b4", "#ff7f0e"]

# Set seaborn style and custom palette
sns.set(style="whitegrid", palette=custom_palette)

# Create the boxplot with custom colors
ax = df.boxplot(column=['Salary Estimate', 'Age'], boxprops=dict(color='lightblue'),
                medianprops=dict(color='red'), whiskerprops=dict(color='green'),
                capprops=dict(color='orange'), flierprops=dict(markerfacecolor='purple'))

# Set custom colors for the axes
ax.set_xlabel('Features')
ax.set_ylabel('Values')
ax.set_title('Estimated Salary and Age Columns')

# Show the plot
plt.show()




![image.png](5.png)

In [None]:
# Define your custom color palette
custom_palette = ["#1f77b4", "#ff7f0e"]

# Set seaborn style and custom palette
sns.set(style="whitegrid", palette=custom_palette)

# Create the boxplot with custom colors
ax = df.boxplot(column=['desc_len', 'Rating'], boxprops=dict(color='lightblue'),
                medianprops=dict(color='red'), whiskerprops=dict(color='green'),
                capprops=dict(color='orange'), flierprops=dict(markerfacecolor='purple'))

# Set custom colors for the axes
ax.set_xlabel('Features')
ax.set_ylabel('Values')
ax.set_title('Desc_len & Rating Columns')

# Show the plot
plt.show()


print("....................................................................................................................")





![image.png](6.png)

In [None]:
# Define your custom color palette
custom_palette = ["#1f77b4"]

# Set seaborn style and custom palette
sns.set(style="whitegrid", palette=custom_palette)

# Create the boxplot with custom colors
ax = df.boxplot(column=['Rating'], boxprops=dict(color='lightblue'),
                medianprops=dict(color='red'), whiskerprops=dict(color='green'),
                capprops=dict(color='orange'), flierprops=dict(markerfacecolor='purple'))

# Set custom colors for the axes
ax.set_xlabel('Features')
ax.set_ylabel('Values')
ax.set_title('Desc_len & Rating Columns')

# Show the plot
plt.show()

![image.png](7.png)

## `Correlation between Variables`


In [None]:
df[['Age','Salary Estimate', 'Rating', 'desc_len']].corr()

In [None]:
cmap = sns.diverging_palette(220,10, as_cmap=True)

sns.heatmap(df[['Age','Salary Estimate', 'Rating', 'desc_len']].corr(), vmax=.3, center=0, cmap=cmap,
           square=True, linewidths=0.5, cbar_kws = {'shrink': .5})

![image.png](8.png)

In [None]:
df.columns

In [None]:
df_cat = df[['Location', 'Headquarters','Size','Type of ownership', 'Industry', 'Sector', 'Revenue','Company Name'
             ,'job_state', 'Same State','Python_yn', 'R Studio', 'Spark', 'AWS_yn',
       'Excel_yn', 'Job_simp', 'seniority']]

In [None]:
# Calculate the number of rows needed for subplots
num_rows = len(df_cat.columns) // 2 + len(df_cat.columns) % 2

# Create subplots
fig, axes = plt.subplots(num_rows, 2, figsize=(10, 8*num_rows))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Iterate over each column in df_cat
for i, col in enumerate(df_cat.columns):
    # Get value counts for the current column
    cat_num = df_cat[col].value_counts()

    # Generate a list of colors for each bar
    colors = sns.color_palette("coolwarm", len(cat_num))

    # Plot the bar chart with custom colors
    sns.barplot(x=cat_num.index, y=cat_num, palette=colors, ax=axes[i])

    # Set title for the subplot
    axes[i].set_title("Graph for %s" % col)

    # Rotate x-axis labels
    axes[i].tick_params(axis='x', rotation=90)

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plots
plt.show()

![image.png](9.png)

## `Clearing the Large plot`

In [None]:
# Get the columns of interest
columns_of_interest = ['Location', 'Headquarters', 'Company Name']

# Calculate the number of rows needed for subplots
num_plots = len(columns_of_interest)
num_rows = (num_plots + 1) // 2

# Create subplots with appropriate number of rows and columns
fig, axes = plt.subplots(num_rows, 2, figsize=(15, 5*num_rows))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Iterate over selected columns in df_cat
for i, col in enumerate(columns_of_interest):
    # Get value counts for the current column
    cat_num = df_cat[col].value_counts()[:20]
    print("Graph for %s: Total = %d" % (col, len(cat_num)))

    # Create a bar plot only if there are values to plot
    if not cat_num.empty:
        # Create a bar plot
        chart = sns.barplot(x=cat_num.index, y=cat_num, palette='viridis', ax=axes[i])

        # Rotate x-axis labels for better readability
        chart.set_xticklabels(chart.get_xticklabels(), rotation=90)

        # Set title for the subplot
        axes[i].set_title("Graph for %s" % col)
    else:
        # Remove the subplot if there are no values to plot
        fig.delaxes(axes[i])

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plots
plt.show()


![image.png](10.png)

In [None]:
df.columns

In [None]:
pd.pivot_table(df, index='Job_simp', values= 'Salary Estimate')

In [None]:
pd.pivot_table(df, index=['Job_simp','seniority'], values= 'Salary Estimate')

In [None]:
pd.set_option('display.max_rows',None)

In [None]:
pd.pivot_table(df, index=['job_state','Job_simp'], values= 'Salary Estimate', aggfunc='count').sort_values('job_state', ascending=False)

In [None]:
df.columns

### `lets only look for the datascience`


In [None]:
pd.pivot_table(df[df.Job_simp=='data scientist'],index='job_state' ,values= 'Salary Estimate').sort_values('Salary Estimate', ascending=False)



## `Salary By Rating`


In [None]:
df.columns

In [None]:
# Rating, Industry, sector, revenue, number of comp, hourly employer provided, python, r, spark, aws,excel,desc lne, type of ownership


In [None]:
df_pivots = df[['Rating', 'Industry', 'Sector', 'Revenue', 'Employee','Num_comp', 'PerHour', 'Python_yn', 'R Studio', 'Spark', 'AWS_yn', 'Excel_yn', 'desc_len', 'Type of ownership','Salary Estimate' ]]

In [None]:
# Check data types and convert 'Salary Estimate' column to numeric if necessary
df_pivots['Salary Estimate'] = pd.to_numeric(df_pivots['Salary Estimate'], errors='coerce')

# Drop rows with NaN values in 'Salary Estimate' column
df_pivots.dropna(subset=['Salary Estimate'], inplace=True)

# Create pivot table for each column
for i in df_pivots.columns:
    if i != 'Salary Estimate':  # Exclude 'Salary Estimate' column from pivot table creation
        print(i)
        pivot_table = pd.pivot_table(df_pivots, index=i, values='Salary Estimate', aggfunc='mean')
        pivot_table_sorted = pivot_table.sort_values('Salary Estimate', ascending=False)
        print(pivot_table_sorted.head(2))  # Print only the top 2 rows


# **`Words Analysis in Description by Word Cloud Plot`**

In [None]:
nltk.download('punkt')


words= " ".join(df['Job Description'])

def punctuation_stop(text):
    filtered = []
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    for w in word_tokens:
        if w not in stop_words and w.isalpha():
            filtered.append(w.lower())
    return filtered

words_filtered = punctuation_stop(words)

text= " ".join([ele for ele in words_filtered])

wc = WordCloud(background_color='black', random_state=1, stopwords=STOPWORDS, max_words=2000,width=800, height=1500)
wc.generate(text)

plt.figure(figsize=(6,25))
plt.imshow(wc,interpolation="bilinear")
plt.axis('off')
plt.show()

<a id='ModelBuilding'></a>
# <p style="background-color:#4CAF50;font-family:Arial, sans-serif;color:#FFFFFF;font-size:90%;text-align:center;border-radius:5px;padding:10px;"><b>5|</b>  Model Building For Salary Prediction  </p>



<div style="color: #1E90FF; display: inline-block; border-radius: 10px; background-color: #F0F8FF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px groove #1E90FF; width:70%;">
    <p style="padding: 15px; color: #1E90FF; overflow: hidden; font-size: 24px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b> Steps to be Followed while building Model:</b>
    </p>
</div>


<div style="border: 2px solid #1E90FF; border-radius: 10px; margin-top: 10px; width:70%">
    <ol style="list-style-type: none; padding: 10px;">
        <li>1. Choose Relevant Columns.</li>
        <li>2. Get Dummy Dates.</li>
        <li>3. Train Test Split.</li>
        <li>4. Multiple Linear Regression.</li>
        <li>5. Lasso Regression.</li>
        <li>6. Random Forest.</li>
        <li>7. Tune Models GridSearchCV.</li>
        <li>8. Test Ensembles.</li>
    </ol>
</div>

<div style="color: #FFA500; display: inline-block; border-radius: 15px; background-color: #FFEFD5; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px solid #FFA500; width:50%">
    <p style="padding: 15px; color: #FFA500; overflow: hidden; font-size: 20px; letter-spacing: 1px; margin: 0; width: auto;">
        <b>1.. Choosing Relevant Columns:</b>
    </p>
</div>


In [None]:
df.columns

In [None]:
df_model = df[['Salary Estimate', 'Rating', 'Size', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Num_comp', 'PerHour','Job Title', 'job_state','Same State','Age','Python_yn', 'Spark', 'AWS_yn', 'Excel_yn', 'Job_simp', 'seniority', 'desc_len']]
df_model.head()

<div style="color: #FFA500; display: inline-block; border-radius: 15px; background-color: #FFEFD5; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px solid #FFA500; width:50%">
    <p style="padding: 15px; color: #FFA500; overflow: hidden; font-size: 20px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b>2. Get Dummy Dates</b>
    </p>
</div>

In [None]:

df_dum  = pd.get_dummies(df_model)
pd.set_option('display.max_rows',None)

df_dum.head()

In [None]:
df1 = pd.DataFrame(df)

In [None]:
# Function to map values to 1, 0, or keep them unchanged
df_int = df.apply(lambda x: x.astype(int) if x.dtype == 'bool' else x)

# Display the resulting DataFrame
pd.set_option('display.max_rows',None)
df_int.head()

<div style="color: #FFA500; display: inline-block; border-radius: 15px; background-color: #FFEFD5; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px solid #FFA500; width:50%">
    <p style="padding: 15px; color: #FFA500; overflow: hidden; font-size: 20px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b>3. Train Test Split</b>
    </p>
</div>



In [None]:
X= df_dum.drop('Salary Estimate', axis=1)
y= df_dum['Salary Estimate'].values

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, random_state=42)


<div style="color: #FFA500; display: inline-block; border-radius: 15px; background-color: #FFEFD5; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px solid #FFA500; width:50%">
    <p style="padding: 15px; color: #FFA500; overflow: hidden; font-size: 20px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b>4. Multiple Linear Regression</b>
    </p>
</div>



In [None]:
# !pip install statsmodels

In [None]:
# import statsmodels.api as sm

# # Assuming X is your independent variable array or DataFrame
# # Assuming y is your dependent variable array or DataFrame

# # Adding a constant term to independent variables
# X_sm = sm.add_constant(X)

# # Fit regression model
# model = sm.OLS(y, X_sm)
# results = model.fit()

# # Inspect the results
# print(results.summary())


In [None]:
lm = LinearRegression()
lm.fit(X_train,y_train)

cross_val_score(lm,X_train,y_train, scoring='neg_mean_absolute_error')

In [None]:
cross_val_score(lm,X_train,y_train, scoring='neg_mean_absolute_error', cv=3)


In [None]:
# the above output is to skewed so by taking its mean we will be to read
np.mean(cross_val_score(lm,X_train,y_train, scoring='neg_mean_absolute_error', cv=2))

<div style="color: #FFA500; display: inline-block; border-radius: 15px; background-color: #FFEFD5; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px solid #FFA500; width:50%">
    <p style="padding: 15px; color: #FFA500; overflow: hidden; font-size: 20px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b>5. Lasso Regression</b>
    </p>
</div>

In [None]:

lm_l = Lasso()
np.mean(cross_val_score(lm_l, X_train, y_train, scoring= 'neg_mean_absolute_error', cv=3))

<b> It means that the lasso Regression model or Algorithm is best from LinearRegresion according to our data</b>


In [None]:
alpha = []
error = []

for i in range(1,1000):
    alpha.append(i/10)
    lml = Lasso(alpha=(i/100))
    error.append(np.mean(cross_val_score(lml,X_train, y_train, scoring='neg_mean_absolute_error', cv=2)))

plt.plot(alpha,error)

![image.png](11.png)

## `Checking how much we improve our model`

In [None]:
err = tuple(zip(alpha, error))
df_err = pd.DataFrame(err, columns= ['Alpha', 'error'])

# checking how much we improve the model
df_err[df_err.error==max(df_err.error)]


Now training the lasso model again on the alpha value equal to 0.3

In [None]:
lm_l = Lasso(alpha=0.3)
lm_l.fit(X_train,y_train)
np.mean(cross_val_score(lm_l, X_train, y_train, scoring= 'neg_mean_absolute_error', cv=3))


### `In Lasso Regression the error we record is:`

In [None]:
np.mean(cross_val_score(lm_l, X_train, y_train, scoring= 'neg_mean_absolute_error', cv=3))

<br>

- <b>we can see that improve our from -21.09619987218103 to -18.272763 and that pretty awesome</b>

<div style="color: #1E90FF; display: inline-block; border-radius: 10px; background-color: #F0F8FF; font-family: 'Arial', sans-serif; overflow: hidden; border: 5px groove #1E90FF; width:50%;">
    <p style="padding: 15px; color: #1E90FF; overflow: hidden; font-size: 24px; letter-spacing: 1px; margin: 0; width: 100%;">
        <b> 6. Random Forest Model or Algorithm</b>
    </p>
</div>

<div style="border: 2px solid #1E90FF; border-radius: 10px; margin-top: 10px; width:50%">
    <ol style="list-style-type: none; padding: 10px;">
        <li></li>
        <li>- Calculating its error.</li>
        <li>- Trying to minimize the error.</li>
    </ol>
</div>

#### `Training the Model`

In [None]:

rf = RandomForestRegressor()

#### `Calculating the error of Random Forest Model`


In [None]:
np.mean(cross_val_score(rf, X_train, y_train, scoring= 'neg_mean_absolute_error', cv=3))


Its quit awesome that Random Forest Model is doing for on our data.

without Minimizing the error it performing better than the last algorithms that we apply on our data.

<a id='Tuning'></a>
# <p style="background-color:#4CAF50;font-family:Arial, sans-serif;color:#FFFFFF;font-size:90%;text-align:center;border-radius:5px;padding:10px;"><b>6|</b>  Tunning by GridsearchCV  </p>



<div style="border: 2px solid #1E90FF; border-radius: 10px; margin-top: 10px; width:50%">
<h3 style= "color:Yellow";>Lets understand how it works:</h3>
    <ol style="list-style-type: none; padding: 10px;">
        <li>- we will give the parameters that we want in model.</li>
        <li>- Based on those parameters it will analyse algorithm.</li>
        <li>- We will select which one is best according to these analysis of GridSearchCV.</li>
    </ol>
</div>






In [None]:
parameters = {
    'n_estimators': range(10, 300, 10),
    'criterion': ('friedman_mse', 'absolute_error'),
    'max_features': (None, 'sqrt', 'log2')
}

gs = GridSearchCV(rf, parameters, error_score='raise', cv=3)
gs.fit(X_train, y_train)

In [None]:
gs.best_score_


In [None]:
gs.best_estimator_


<h3 style="color:yellow;"> Summary of GridSearchCV:</h3>

Looks like we achieve much better or we minimize the error from `-15.12471850484541` to `0.6424335492673592` amazing right.<br>


we can achieve the accuracy of our model very high if we use the following parameters according to GridSearchCV method:

<p style ="color:RED;font-size:30px"><b>Parameters:</b></p>
1. criterion should be `absolute_error`.
2. max_features should be `None`.
3. n_estimators should be `40`.

<a id='ensembles'></a>
# <p style="background-color:#4CAF50;font-family:Arial, sans-serif;color:#FFFFFF;font-size:90%;text-align:center;border-radius:5px;padding:10px;"><b>7|</b>  Model Training or Test Ensembles:  </p>


Now we can train our because we know by now everything that are needed for our model to perform outstanding...

In [None]:
tpred_lm = lm.predict(X_test)
tpred_lml = lm_l.predict(X_test)
tpred_rf = gs.best_estimator_.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test,tpred_lm)


In [None]:
mean_absolute_error(y_test,tpred_lml)


In [None]:
mean_absolute_error(y_test,tpred_rf)

### `Always two of them which are best`


In [None]:
mean_absolute_error(y_test,(tpred_lml+tpred_rf)/2)


This value should in between the value of tpred_lml and tpred_rf and it is in between so it correct and our model is not over tranied.<br>

LinearRegression is suitable for our data which can be seen cause it has high value.

<a id='production'></a>
# <p style="background-color:#4CAF50;font-family:Arial, sans-serif;color:#FFFFFF;font-size:90%;text-align:center;border-radius:5px;padding:10px;"><b>8|</b>  Putting the Model Into Production:  </p>

Using pickle to store the neccessory values or variables into a file of pickle so that it can be used in the flask app....

In [None]:
import pickle
pickl = {'model': gs.best_estimator_}
pickle.dump(pickl,open('model_file'+ '.p', "wb"))




In [None]:
file_name = "model_file.p"

with open (file_name, "rb") as pickled:
    data = pickle.load(pickled)
    model= data["model"]

model.predict(X_test.iloc[1,:].values.reshape(1,-1))



The Above value is the prediction.
#####  `Everything is good now its time to built the flask app for it but for now in this notebook everthing is finished.....`