# US Visa Prediction Project
**Life cycle of Machine learning Project**

* Understanding the Problem Statement
* Data Collection
* Exploratory data analysis
* Data Cleaning
* Data Pre-Processing
* Model Training 
* Choose best model

### About
The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. 
The act also protects US workers against adverse impacts on working place and maintain requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

## 1) Problem statement.

* OFLC gives job certification applications for employers seeking to bring foreign workers into the United States and grants certifications. 
* As In last year the count of employees were huge so OFLC needs Machine learning models to shortlist visa applicants based on their previous data.

**In this project we are going to use the data given to build a Classification model:**

* This model is to check if Visa get approved or not based on the given dataset.
* This can be used to Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the certain criteria which influences the decision.

## 2) Data Collection.
* The Dataset is part of Office of Foreign Labor Certification (OFLC)
* The data consists of 25480 Rows and 12 Columns

https://www.kaggle.com/datasets/moro23/easyvisa-dataset

### 2.1 Import Data and Required Packages

**Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

In [5]:
df = pd.read_csv('Visadataset.csv')

In [7]:
df.head() #Showing top 5 rows

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


In [8]:
#Shape of the dataset
df.shape

(25480, 12)

In [11]:
#Summary of the data
df.describe().style.background_gradient(cmap='viridis', axis=None)

Unnamed: 0,no_of_employees,yr_of_estab,prevailing_wage
count,25480.0,25480.0,25480.0
mean,5667.04321,1979.409929,74455.814592
std,22877.928848,42.366929,52815.942327
min,-26.0,1800.0,2.1367
25%,1022.0,1976.0,34015.48
50%,2109.0,1997.0,70308.21
75%,3504.0,2005.0,107735.5125
max,602069.0,2016.0,319210.27


In [12]:
#Checking the datatype 
df.dtypes

case_id                   object
continent                 object
education_of_employee     object
has_job_experience        object
requires_job_training     object
no_of_employees            int64
yr_of_estab                int64
region_of_employment      object
prevailing_wage          float64
unit_of_wage              object
full_time_position        object
case_status               object
dtype: object

### *Exploring the Data*

In [32]:
categorical_features = df.select_dtypes('O').columns
numeric_features = [i for i in df.columns if i not in categorical_features]
print("We have {a} categorical features-{c}\nand\n{b} numeric features-{d}".format(a=len(categorical_features),b=len(numeric_features),c = list(categorical_features),d=list(numeric_features)))

We have 9 categorical features-['case_id', 'continent', 'education_of_employee', 'has_job_experience', 'requires_job_training', 'region_of_employment', 'unit_of_wage', 'full_time_position', 'case_status']
and
3 numeric features-['no_of_employees', 'yr_of_estab', 'prevailing_wage']


In [37]:

# proportion of count data on categorical columns
for i in categorical_features:
    print(df[i].value_counts(normalize=True))
    print("----------------------")

EZYV01       0.000039
EZYV16995    0.000039
EZYV16993    0.000039
EZYV16992    0.000039
EZYV16991    0.000039
               ...   
EZYV8492     0.000039
EZYV8491     0.000039
EZYV8490     0.000039
EZYV8489     0.000039
EZYV25480    0.000039
Name: case_id, Length: 25480, dtype: float64
----------------------
Asia             0.661735
Europe           0.146468
North America    0.129199
South America    0.033438
Africa           0.021625
Oceania          0.007535
Name: continent, dtype: float64
----------------------
Bachelor's     0.401648
Master's       0.378100
High School    0.134223
Doctorate      0.086028
Name: education_of_employee, dtype: float64
----------------------
Y    0.580926
N    0.419074
Name: has_job_experience, dtype: float64
----------------------
N    0.884027
Y    0.115973
Name: requires_job_training, dtype: float64
----------------------
Northeast    0.282378
South        0.275392
West         0.258477
Midwest      0.169035
Island       0.014717
Name: region_of_emp