# CONFIDENCE SCORE OF LISTINGS IN OUR DATABASE

* For this task, we are going to implement supervised machine learning with the aim of predicting which new listings are more likely to be alive based on their verification status before being pushed to the database.


* Some of the **facts** we are working with when it comes to this task include :

    1. Total number of verified listings in our database as of 16th May 2023 = **1199298**
    2. Most of the verified listings are from South Africa, Kenya and Ethiopia
    
    
* Based on the hypotheses above, this introduces a **bias** in our dataset which can be dealt with using the **resampling techniques** i.e. oversampling and undersampling before choosing which technique gives us the best accuracy. 
    
    
* The **three verification statuses** for listings in our database include :

    1. 0 - Pending verification/Not verified
    2. 1 - Verified
    3. 2 - Rejected
    
    
* The **assumptions** we are working with in this case include :

    1. The more properties of a listing we have, the more likely the listing is to be verified.
    2. The verified listings in our database have a higher probability of being alive. 
    

## Importing the necessary libraries into our environment

In [566]:
# Pivot Table Package
# !pip install --upgrade pivottablejs
# !pip install category-encoders
# !pip install imblearn

In [567]:
# Importing the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pivottablejs import pivot_ui 
from IPython.display import HTML
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from category_encoders import TargetEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.utils import resample
from sklearn.inspection import permutation_importance
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

## Reading the datasets

We start by compiling all the train data sets from the database and concatenating them into one file.

In [568]:
# Reading the datasets
# Concat all the files into one (Training Files)
csv_files = [r"C:\Users\derek\Downloads\supervised_training_data(1).csv", r"C:\Users\derek\Downloads\supervised_training_data(2).csv", r"C:\Users\derek\Downloads\supervised_training_data(3).csv", r"C:\Users\derek\Downloads\supervised_training_data(4).csv", r"C:\Users\derek\Downloads\supervised_training_data(5).csv", r"C:\Users\derek\Downloads\supervised_training_data(6).csv", r"C:\Users\derek\Downloads\supervised_training_data(7).csv", r"C:\Users\derek\Downloads\supervised_training_data(8).csv", r"C:\Users\derek\Downloads\supervised_training_data(9).csv",r"C:\Users\derek\Downloads\supervised_training_data(10).csv"]

The cell below is where we will be posting the datasets containing new listings.

In [569]:
# Read the files and preview the dataset
training_data = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True)
print(training_data.shape)
training_data.head()

(1000000, 20)


Unnamed: 0,c.company_id,c.company_name_en,c.country,c.category_list,c.email,c.mobile,c.health_score,c.isp_provider,c.geo_code,c.has_contact_number,c.website,c.website_status,c.data_source,c.building_name,c.hours,c.confidence_indicator,c.latitude,c.longitude,c.is_headquarter,c.is_verified
0,"""c4ca4773-6a91-40ab-aeaf-a999dba845f2""","""Disc Digital Imaging Services""",,"""Printers,Security Printers,Screen Printers,Pr...",,,0.285714,,,,,,"""yellow_data""",,"[""2-6:08:00-17:00"", ""7-1:Closed""]",,,,,1.0
1,"""41385e2d-74d5-4cec-bd24-54f01a71d18b""","""Birhanu Tadese Getane""",,,,"[""+251911803485""]",0.285714,"[""Ethio Telecom""]",,1.0,,,"""Ministry of Trade Ethiopia""",,,,,,,0.0
2,"""89bf3430-e80c-449d-b485-49a503233c4b""","""Tarekegn Tadese Agemasie""",,,,,0.285714,,,1.0,,,"""Ministry of Trade Ethiopia""",,,,,,,0.0
3,"""b1b4b5cc-d5ce-4a6a-9d6f-f10aaa746537""","""Akovic Stores""","""Nigeria""","""Importers & Exporters""",,"[""+2348106479756""]",0.571429,"[""MTN""]","point({srid:4326, x:8.6181595, y:7.7716393})",1.0,"[""https://bit.ly""]","""inactive""","""GMB""",,,,7.771639,8.61816,,0.0
4,"""8841ee02-eb35-4aa1-9685-7761084cdca3""","""Spectrum Distributors""","""South Africa""","""Screen Printing Equipment & Supplies,Printing...",,,0.428571,,,1.0,,,"""business_list""",,,,,,,1.0


In [570]:
# Lets create a dataset that will contain the original dataset which we will use to retrieve the company names using the index. 
final = training_data
print(final.shape)

(1000000, 20)


## Data Cleaning

**1. First step is to deal with duplicates in our dataset. We drop them so as to maintain integrity in our data.**

In [571]:
# Check for duplicates on overall dataset
training_data.duplicated().sum()

52002

In [572]:
# Check for duplicates on the company_ID column
training_data['c.company_id'].duplicated().sum()

52002

In [573]:
# Drop the duplicated records from the original dataset
training_data = training_data.drop_duplicates(subset = 'c.company_id', keep='first')
print(training_data.shape)

# Keep the first duplicate based on the company name, phone, email and country
columns = ['c.company_name_en','c.category_list','c.country','c.email'] # Location_list, geocodes, 
training_data = training_data.drop_duplicates(subset = columns, keep='first')
print(training_data.shape)

(947998, 20)
(864439, 20)


**2. We then deal with null values in our columns. In this case we can impute the nulls with 0 or drop altogether for purposes of analysis and modelling.**

In [574]:
# We can view the distribution of null values as well as data types of columns in our dataset
training_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 864439 entries, 0 to 999999
Data columns (total 20 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   c.company_id            864439 non-null  object 
 1   c.company_name_en       864438 non-null  object 
 2   c.country               536502 non-null  object 
 3   c.category_list         405426 non-null  object 
 4   c.email                 109656 non-null  object 
 5   c.mobile                421798 non-null  object 
 6   c.health_score          864119 non-null  float64
 7   c.isp_provider          447078 non-null  object 
 8   c.geo_code              325016 non-null  object 
 9   c.has_contact_number    511889 non-null  float64
 10  c.website               160918 non-null  object 
 11  c.website_status        122243 non-null  object 
 12  c.data_source           457087 non-null  object 
 13  c.building_name         18187 non-null   object 
 14  c.hours             

In [575]:
# We can drop the columns with the highest percentage of null values as well as unnecessary fields
training_data.drop(['c.company_id', 'c.building_name', 'c.is_headquarter', 'c.confidence_indicator'], axis=1, inplace = True)

In [576]:
# Impute the nulls with 0
training_data['c.has_contact_number'].fillna(0, inplace=True)

## Data Exploration

In [577]:
# Verification status 
training_data['c.is_verified'].value_counts()

0.0    731993
1.0    132413
2.0        31
Name: c.is_verified, dtype: int64

**This is a good training data set because it accurately represents what we have in the database i.e. verified listings are about 15% of the database.**

In [578]:
# Website Status
training_data['c.website_status'].unique()

array([nan, '"inactive"', '"active"'], dtype=object)

In [579]:
# Category List
training_data['c.category_list'].unique()

array(['"Printers,Security Printers,Screen Printers,Printers Packaging,Printers - Screen"',
       nan, '"Importers & Exporters"', ..., '"Banks,Motoring Services"',
       '"Cartons,Construction,Partitions"',
       '"Videography,Chemists-Dispensing,Health Care/Personal Care and Social Assistance"'],
      dtype=object)

In [580]:
# Health Score
training_data['c.health_score'].unique()

array([0.28571429, 0.57142857, 0.42857143, 0.14285714, 0.71428571,
              nan])

In [581]:
# Country by verified listings
country = pd.pivot_table(training_data, index='c.country', values='c.company_name_en', columns='c.is_verified', aggfunc='count')
country.fillna(0, inplace=True)
country

c.is_verified,0.0,1.0,2.0
c.country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"""Angola""",1685.0,0.0,0.0
"""Australia""",2.0,0.0,0.0
"""Belgium""",1.0,0.0,0.0
"""Benin""",145.0,0.0,0.0
"""Botswana""",10107.0,0.0,0.0
"""Burkina Faso""",27.0,0.0,0.0
"""Burundi""",8.0,0.0,0.0
"""Cameroon""",43.0,0.0,0.0
"""Chad""",8.0,0.0,0.0
"""Congo""",2.0,0.0,0.0


**This confirms our earlier hypothesis which was that most verified listings are from the following countries i.e. Kenya, South Africa, Ethiopia, Nigeria, Uganda and Ghana.**

In [582]:
# Data Source
training_data['c.data_source'].value_counts()

"Ministry of Trade Ethiopia"    222264
"GMB"                            61837
"Addis Ababa Business List"      49770
"yellow_data"                    40349
"business_list"                  37253
                                 ...  
"ngo_board_kenya"                    1
"15024749"                           1
"Namibia2034"                        1
"NigeriaGolf Clubs"                  1
"11081709"                           1
Name: c.data_source, Length: 130, dtype: int64

In [583]:
# Remove column values that are represented just once in the dataset
# Initial training dataset shape
print(training_data.shape)

# Value Counts
country_counts = training_data['c.country'].value_counts()
category_counts = training_data['c.category_list'].value_counts()
isp_counts = training_data['c.isp_provider'].value_counts()
data_counts = training_data['c.data_source'].value_counts()

# Get the column values that appear less than twice
countries_to_delete = country_counts[country_counts < 10].index
categories_to_delete = category_counts[category_counts < 10].index
isp_to_delete = isp_counts[isp_counts < 2].index
data_to_delete = data_counts[data_counts < 2].index

# Delete rows with countries or categories appearing less than ten times
training_data_1 = training_data[~training_data['c.country'].isin(countries_to_delete)]
training_data_2 = training_data_1[~training_data_1['c.category_list'].isin(categories_to_delete)]
training_data_3 = training_data_2[~training_data_2['c.isp_provider'].isin(isp_to_delete)]
training_data = training_data_3[~training_data_3['c.data_source'].isin(data_to_delete)]

print(training_data.shape)

(864439, 16)
(854339, 16)


In [584]:
# Explore the different metrics in our dataset using PivottableJS
# pivot_ui(training_data, outfile_path='pivottablejs.html')
# HTML('pivottablejs.html')

## Feature Engineering

In [585]:
# Has website, contacts, category, country, working hours, in building
training_data['c.has_website'] = np.where(training_data['c.website'].isnull(),'No','Yes')
training_data['c.has_geocode'] = np.where(training_data['c.geo_code'].isnull() | training_data['c.latitude'].isnull() | training_data['c.longitude'].isnull(),'No','Yes')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data['c.has_website'] = np.where(training_data['c.website'].isnull(),'No','Yes')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data['c.has_geocode'] = np.where(training_data['c.geo_code'].isnull() | training_data['c.latitude'].isnull() | training_data['c.longitude'].isnull(),'No','Yes')


In [586]:
# Extracting the domain from email addresses
training_data['c.email'].unique()

array([nan, '["info@bodymindfitness.co.za"]', '["info@parklift.co.za"]',
       ..., '["karen@ckfisio.co.za", "colleen@ckfisio.co.za"]',
       '["lamacioltd@gmail.com"]', '["dunrite@mweb.co.za"]'], dtype=object)

## Splitting the dataset into train and test sets

* First we will fill all nulls in our dataset with 0.

In [587]:
# Fill any null values with 0
training_data.fillna(0, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data.fillna(0, inplace=True)


* Then we separate the feature and target variables.

In [588]:
# Separate feature and target variables
cat_cols = ['c.country', 'c.category_list', 'c.isp_provider', 'c.health_score', 'c.has_contact_number', 'c.website', 'c.website_status', 'c.data_source','c.has_geocode']
target = ['c.is_verified']

* Afterwards we label encode the feature variables before splitting into our train and test sets. 

In [589]:
# Label encode categorical features
for col in cat_cols:
    training_data[col] = training_data[col].astype('category')
    training_data[col] = training_data[col].cat.codes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data[col] = training_data[col].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data[col] = training_data[col].cat.codes
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_data[col] = training_data[col].astype('category')
A value is trying to be set on a copy

In [590]:
# Set our target and feature variables
X = training_data[cat_cols]
y = training_data[target]

In [591]:
# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Feature Importance

**In order to calculate the most important features in our dataset, we can use a RandomForestClassifier to get a sorted list of the important features which will be used in predicting the verification status of our listings. Afterwards, we can also use the permutation importance feature just for comparison on our most important features** 

In [592]:
# Initiate a random forest classifier before fitting to the train set
model = RandomForestClassifier()
model.fit(X_train, y_train)

  model.fit(X_train, y_train)


RandomForestClassifier()

In [593]:
# Get feature importance scores
importance_scores = model.feature_importances_

# Get feature names
feature_names = X.columns  # feature variables

# Create a dictionary to store feature importance scores with their corresponding feature names
feature_importance_dict = dict(zip(feature_names, importance_scores))

# Sort the feature importance dictionary in descending order based on the importance scores
sorted_feature_importance = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

# Print the sorted feature importance in descending order
for feature, importance_score in sorted_feature_importance:
    print(f"{feature}: {importance_score:.4f}")

c.data_source: 0.4881
c.health_score: 0.2137
c.isp_provider: 0.0769
c.country: 0.0681
c.has_geocode: 0.0442
c.website: 0.0327
c.category_list: 0.0306
c.website_status: 0.0262
c.has_contact_number: 0.0194


In [594]:
# Calculate feature importance using permutation importance
results = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)

importance = results.importances_mean
importance_std = results.importances_std
feature_names = X.columns

# Print feature importance
for feature, importance, std in zip(feature_names, importance, importance_std):
    print(f"{feature}: {importance:.4f} +/- {std:.4f}")

c.country: 0.0272 +/- 0.0003
c.category_list: 0.0042 +/- 0.0001
c.isp_provider: 0.0445 +/- 0.0002
c.health_score: 0.1236 +/- 0.0005
c.has_contact_number: 0.0019 +/- 0.0001
c.website: 0.0085 +/- 0.0001
c.website_status: 0.0024 +/- 0.0001
c.data_source: 0.1231 +/- 0.0003
c.has_geocode: 0.0568 +/- 0.0002


## Dealing with the class imbalance in our 'is_verified' column

* Considering that majority of the listings are yet to be verified (0), this means that we will need to resample our train set so as to avoid inaccuracies in our predictions. In our case, we can perform both oversampling and undersampling to compare which technique gives us the most accurate results.

In [595]:
# Initiate the oversampling and undersampling classes
oversampler = RandomOverSampler(random_state=42)
undersampler = RandomUnderSampler(random_state=42)

X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)  # Oversample the minority class
X_train_undersampled, y_train_undersampled = undersampler.fit_resample(X_train, y_train) # Undersample the majority class

## Model Selection and Training

**1. Logistic Regression**

In [596]:
# Train the model
model = LogisticRegression()
model.fit(X_train_undersampled, y_train_undersampled)

# Evaluate the model using the classification report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

         0.0       0.92      0.27      0.42    218161
         1.0       0.19      0.85      0.31     38135
         2.0       0.00      0.83      0.00         6

    accuracy                           0.36    256302
   macro avg       0.37      0.65      0.24    256302
weighted avg       0.81      0.36      0.40    256302



**2. Random Forest Classifier**

In [597]:
# Create a RandomForestClassifier instance
rf = RandomForestClassifier(max_depth=2, n_estimators=100)

 # Fit to the train data
rf.fit(X_train_undersampled, y_train_undersampled)

# Evaluate the model using a classification report
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

  rf.fit(X_train_undersampled, y_train_undersampled)


              precision    recall  f1-score   support

         0.0       0.99      0.70      0.82    218161
         1.0       0.38      0.93      0.54     38135
         2.0       0.00      0.83      0.00         6

    accuracy                           0.74    256302
   macro avg       0.46      0.82      0.45    256302
weighted avg       0.90      0.74      0.78    256302



In [598]:
# Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 5, 10],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],  # Minimum number of samples required to be at a leaf node
    'max_features': ['sqrt', 'log2']  # Number of features to consider when looking for the best split
}

# Create a RandomForestClassifier instance
rf = RandomForestClassifier()

# Perform hyperparameter tuning using RandomizedSearchCV
rf_random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=5, scoring='recall', random_state=42)
rf_random_search.fit(X_train_undersampled, y_train_undersampled)

# Print the best hyperparameters and the best score
print("Best Hyperparameters:", rf_random_search.best_params_)
print("Best Score:", rf_random_search.best_score_)

# Get the best model from RandomizedSearchCV
best_rf_model = rf_random_search.best_estimator_

# Make predictions on the test set using the best model
y_pred = best_rf_model.predict(X_test)

# Evaluate the model performance
print(classification_report(y_test, y_pred))

  estimator.fit(X_train, y_train, **fit_params)
Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classificati

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Traceback (most recent call last):
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 216, in __call__
    return self._score(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "C:\Users\derek\anaconda3\lib\site-packages\sklearn\metrics\_classification.py", line 1365, in _check_set_wise_labels
   

Best Hyperparameters: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'log2', 'max_depth': 10}
Best Score: nan
              precision    recall  f1-score   support

         0.0       0.99      0.84      0.91    218161
         1.0       0.57      0.92      0.70     38135
         2.0       0.00      0.83      0.00         6

    accuracy                           0.86    256302
   macro avg       0.52      0.87      0.54    256302
weighted avg       0.93      0.86      0.88    256302



* **In the case above, we will use the recall score as our metric of evaluation because we want to focus on the high quality listings which we'd need to be correctly verified (1) before proceeding to treat them as leads.** 


* **Keeping this in mind, we will settle with the Random Forest Classifier as it gives us a recall score of 92% for the listings with the verification status 1.**

## Model Evaluation

In [631]:
# Reading the csv files into our environment
csv_files_1=[r"C:\Users\derek\Downloads\supervised_test_data(1).csv", r"C:\Users\derek\Downloads\supervised_test_data(2).csv", r"C:\Users\derek\Downloads\supervised_test_data(3).csv", r"C:\Users\derek\Downloads\supervised_test_data(4).csv", r"C:\Users\derek\Downloads\supervised_test_data(5).csv"]

In [632]:
# Read the files and preview the dataset
test_data = pd.concat([pd.read_csv(file) for file in csv_files_1], ignore_index=True)
print(test_data.shape)

(500000, 20)


In [633]:
# Drop the duplicated records from the original dataset
test_data = test_data.drop_duplicates(subset = 'c.company_id', keep='first')
print(test_data.shape)

# Keep the first duplicate based on the company name, phone, email and country
columns = ['c.company_name_en','c.category_list','c.country','c.email'] # Location_list, geocodes, 
test_data = test_data.drop_duplicates(subset = columns, keep='first')
print(test_data.shape)

# We can drop the columns with the highest percentage of null values as well as unnecessary fields
test_data.drop(['c.company_id', 'c.building_name', 'c.is_headquarter', 'c.confidence_indicator'], axis=1, inplace = True)

(488223, 20)
(453655, 20)


In [634]:
# Impute the nulls with 0
test_data['c.has_contact_number'].fillna(0, inplace=True)

# Verification status 
print(test_data['c.is_verified'].value_counts())

0    384768
1     68869
2        18
Name: c.is_verified, dtype: int64


In [635]:
# Has website, contacts, category, country, working hours, in building
test_data['c.has_website'] = np.where(test_data['c.website'].isnull(),'No','Yes')
test_data['c.has_geocode'] = np.where(test_data['c.geo_code'].isnull() | test_data['c.latitude'].isnull() | test_data['c.longitude'].isnull(),'No','Yes')

# Fill any null values with 0
test_data.fillna(0, inplace=True)

In [636]:
# Separate feature and target variables
cat_cols = ['c.country', 'c.category_list', 'c.isp_provider', 'c.health_score', 'c.has_contact_number', 'c.website', 'c.website_status', 'c.data_source','c.has_geocode']
target = ['c.is_verified']

# Label encode categorical features
for col in cat_cols:
    test_data[col] = test_data[col].astype('category')
    test_data[col] = test_data[col].cat.codes
    
# Split between the feature and target variables    
test = test_data[cat_cols]
actual = test_data[target].values

In [638]:
# Predicting on the unseen data
prediction = best_rf_model.predict(test)
print(f"Predicted verification status for the new listing: {prediction}")

print(classification_report(prediction, actual))

Predicted verification status for the new listing: [1. 0. 0. ... 1. 0. 0.]
              precision    recall  f1-score   support

         0.0       0.83      0.99      0.90    321581
         1.0       0.95      0.54      0.69    120338
         2.0       0.67      0.00      0.00     11736

    accuracy                           0.85    453655
   macro avg       0.82      0.51      0.53    453655
weighted avg       0.86      0.85      0.82    453655

