

----

# **`MongoDB Practice`**

### **Author** : **Muhammad Adil Naeem**
### **Contact** : **madilnaeem0@gmail.com**
 
------

### **Dataset Link**

[EasyVisa Dataset](https://www.kaggle.com/datasets/moro23/easyvisa-dataset)

- The dataset contains data on US Visa applications for foreign employees.

### **Import Libraries**

In [36]:
import os
import pymongo
import pandas as pd

### **Load Dataset**

In [37]:
df = pd.read_csv(r"E:\MLOPs-US-Visa-Approval-Prediction-Project\notebooks\dataset\EasyVisa.csv")
df.head()

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


### **Data Shape**

In [38]:
print(f"This Dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

This Dataset has 25480 rows and 12 columns.


### **Data Information**

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB


### **Let's Convert Our CSV Data to Dictionary Format**

- Since mongodb can sotre data into key value pair so will convert our data into dict format.

In [40]:
# Convert the DataFrame 'df' into a list of dictionaries.
# Each dictionary corresponds to a row in the DataFrame.
# The 'orient' parameter specifies the format of the output:
# - "records" means that each row will be represented as a separate dictionary,
#   where the keys are the column names and the values are the corresponding row values.
data = df.to_dict(orient="records")

In [41]:
## Print the list of dictionaries
# data
len(data)

25480

### **Setup Mongodb Connection**

- Inside mongodb cluster we will create a database. In that database we will create a collection in which we will store our dictionary data.

In [None]:
DB_NAME = "US_VISA_PREDICTION"
COLLECTION_NAME = "US_VISA_DATA"
CONNECTION_URL = "mongodb+srv://<username>:<password>@cluster0.3zupq.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

### **Set up Mongodb Collection**

In [43]:
# Initialize a client connection to MongoDB using the specified connection URL
client = pymongo.MongoClient(CONNECTION_URL)

# Access the specified database within MongoDB
data_base = client[DB_NAME]

# Access the specified collection within the database
collection = data_base[COLLECTION_NAME]

# Insert multiple documents into the collection
result = collection.insert_many(data)


### **Fetch Data from Mongodb**

In [44]:
# Retrieve all documents from the 'collection' in the database
records = collection.find()

# Display the retrieved documents
records


<pymongo.synchronous.cursor.Cursor at 0x1ac334a7940>

In [45]:
# # Loop through each record in 'records' with an index
# for i, j in enumerate(records):
#     # Print the index 'i' and the record 'j' in a formatted string
#     print(f"{i} - {j}")

# uncomment this code to view all records

# view first 5 records

for i, j in enumerate(records[:5]):
    print(f"{i} - {j}")

0 - {'_id': ObjectId('6724c1167f06b6d9fd0ce1a3'), 'case_id': 'EZYV01', 'continent': 'Asia', 'education_of_employee': 'High School', 'has_job_experience': 'N', 'requires_job_training': 'N', 'no_of_employees': 14513, 'yr_of_estab': 2007, 'region_of_employment': 'West', 'prevailing_wage': 592.2029, 'unit_of_wage': 'Hour', 'full_time_position': 'Y', 'case_status': 'Denied'}
1 - {'_id': ObjectId('6724c1167f06b6d9fd0ce1a4'), 'case_id': 'EZYV02', 'continent': 'Asia', 'education_of_employee': "Master's", 'has_job_experience': 'Y', 'requires_job_training': 'N', 'no_of_employees': 2412, 'yr_of_estab': 2002, 'region_of_employment': 'Northeast', 'prevailing_wage': 83425.65, 'unit_of_wage': 'Year', 'full_time_position': 'Y', 'case_status': 'Certified'}
2 - {'_id': ObjectId('6724c1167f06b6d9fd0ce1a5'), 'case_id': 'EZYV03', 'continent': 'Asia', 'education_of_employee': "Bachelor's", 'has_job_experience': 'N', 'requires_job_training': 'Y', 'no_of_employees': 44444, 'yr_of_estab': 2008, 'region_of_empl

### **Let's Convert this Data into Dataframe**

In [None]:
# Convert the documents from the MongoDB collection into a DataFrame
# 'list(collection.find())' fetches all documents from the collection and converts them to a list of dictionaries
df1 = pd.DataFrame(list(collection.find()))

# Display the first five rows of the DataFrame
df1.head()


Unnamed: 0,_id,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,6724c1167f06b6d9fd0ce1a3,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,6724c1167f06b6d9fd0ce1a4,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.65,Year,Y,Certified
2,6724c1167f06b6d9fd0ce1a5,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.86,Year,Y,Denied
3,6724c1167f06b6d9fd0ce1a6,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.03,Year,Y,Denied
4,6724c1167f06b6d9fd0ce1a7,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.39,Year,Y,Certified


-------