**Program Goals:**

**a) To predict the Crime rates for the given data in "CPRI-customer.csv" file**

**b) To Estimate the Population of the City based on the year of prediction**

**c) Calculate the "Total number of Crime cases"**

**d) Categorize the "Crime Status Level Area" info using predicted crime rate values**

**e) Write out an Output file with customer data along with the above mentioned data**

**f) Also, to write out a Pickle file, which can be used by the frontend programs**

**1) Importing the required Python Libraries and packages**

In [43]:
# Importing the libraries
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
from sklearn import model_selection
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore")

**2) Read the Training Data file and check**

In [44]:
# Importing the Training CPRI dataset
train_df = pd.read_csv('D:/CRPI-Latest1/Data-Files/CRPI-Mod.csv')

In [45]:
train_df.shape

(1520, 5)

In [46]:
train_df.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type,Crime Rate
0,2014,Ahmedabad,63.5,Murder,1.291339
1,2015,Ahmedabad,63.5,Murder,1.480315
2,2016,Ahmedabad,63.5,Murder,1.622047
3,2017,Ahmedabad,63.5,Murder,1.417323
4,2018,Ahmedabad,63.5,Murder,1.543307


In [47]:
train_df.tail()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type,Crime Rate
1515,2017,Surat,45.8,Cyber Crimes,2.292576
1516,2018,Surat,45.8,Cyber Crimes,3.384279
1517,2019,Surat,45.8,Cyber Crimes,4.978166
1518,2020,Surat,45.8,Cyber Crimes,4.454148
1519,2021,Surat,45.8,Cyber Crimes,6.462882


In [48]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1520 entries, 0 to 1519
Data columns (total 5 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           1520 non-null   int64  
 1   City                           1520 non-null   object 
 2   Population (in Lakhs) (2011)+  1520 non-null   float64
 3   Type                           1520 non-null   object 
 4   Crime Rate                     1520 non-null   float64
dtypes: float64(2), int64(1), object(2)
memory usage: 59.5+ KB


In [49]:
train_df.isnull().sum()

Year                             0
City                             0
Population (in Lakhs) (2011)+    0
Type                             0
Crime Rate                       0
dtype: int64

**Note: There is no missing values in the given input data file**

In [50]:
train_df.shape

(1520, 5)

In [51]:
# Checking and removing Duplciate records, if any
train_df.drop_duplicates(inplace = True)

In [52]:
train_df.shape

(1520, 5)

**Note: The total number of rows of data before and after executing duplicate records removal command are same. Hence there are no duplciate records in the given input file**

**3) Read the Customer Data file and check**

In [53]:
# Reading the CPRI-Customer dataset
cust_df = pd.read_csv('D:/CRPI-Latest1/Data-Files/CRPI-Customer.csv')

In [54]:
cust_df.shape

(190, 4)

In [55]:
cust_df.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type
0,2025,Ahmedabad,63.5,Murder
1,2025,Bengaluru,85.0,Murder
2,2025,Chennai,87.0,Murder
3,2025,Coimbatore,21.5,Murder
4,2025,Delhi,163.1,Murder


In [56]:
cust_df.tail()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type
185,2025,Patna,20.5,Cyber Crimes
186,2025,Pune,50.5,Cyber Crimes
187,2025,Surat,45.8,Cyber Crimes
188,2025,Chennai,87.0,Crime against ST
189,2025,Coimbatore,21.5,Crime against ST


In [57]:
cust_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 4 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           190 non-null    int64  
 1   City                           190 non-null    object 
 2   Population (in Lakhs) (2011)+  190 non-null    float64
 3   Type                           190 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.1+ KB


In [58]:
cust_df.isnull().sum()

Year                             0
City                             0
Population (in Lakhs) (2011)+    0
Type                             0
dtype: int64

**Note: There is no missing values in the given input data file**

In [59]:
cust_df.shape

(190, 4)

In [60]:
# Remove all duplicates:
cust_df.drop_duplicates(inplace = True)

In [61]:
cust_df.shape

(190, 4)

**Note: The total number of rows of data before and after executing duplicate records removal command are same. Hence there are no duplciate records in the given input file**

**4) Combine Training and Customer Data file and check**

In [62]:
train_df['train']=1
cust_df['test'] = 0

In [63]:
combined_df  = pd.concat([train_df, cust_df])
combined_df.shape

(1710, 7)

In [64]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1710 entries, 0 to 189
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           1710 non-null   int64  
 1   City                           1710 non-null   object 
 2   Population (in Lakhs) (2011)+  1710 non-null   float64
 3   Type                           1710 non-null   object 
 4   Crime Rate                     1520 non-null   float64
 5   train                          1520 non-null   float64
 6   test                           190 non-null    float64
dtypes: float64(4), int64(1), object(2)
memory usage: 106.9+ KB


**Note: We need to preprocess or convert the data in the columns "City" and "Type" into its corresponding numerical values so that the Regression algorihm can read the data in this dataframe without any error.**

**5) Combined Data file Preprocessing**

In [65]:
le = preprocessing.LabelEncoder()
combined_df['City1'] = le.fit_transform(combined_df.City.values)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1710 entries, 0 to 189
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           1710 non-null   int64  
 1   City                           1710 non-null   object 
 2   Population (in Lakhs) (2011)+  1710 non-null   float64
 3   Type                           1710 non-null   object 
 4   Crime Rate                     1520 non-null   float64
 5   train                          1520 non-null   float64
 6   test                           190 non-null    float64
 7   City1                          1710 non-null   int32  
dtypes: float64(4), int32(1), int64(1), object(2)
memory usage: 113.6+ KB


In [66]:
combined_df1 = combined_df.sort_values(by=["City","City1"])
Unique_City_City1 = combined_df1[['City', 'City1']].drop_duplicates()
print(Unique_City_City1)

           City  City1
0     Ahmedabad      0
8     Bengaluru      1
16      Chennai      2
24   Coimbatore      3
32        Delhi      4
40    Ghaziabad      5
48    Hyderabad      6
56       Indore      7
64       Jaipur      8
72       Kanpur      9
80        Kochi     10
88      Kolkata     11
96    Kozhikode     12
104     Lucknow     13
112      Mumbai     14
120      Nagpur     15
128       Patna     16
136        Pune     17
144       Surat     18


In [67]:
# Converting "Type" column based data from "Object" to "Integer"
le = preprocessing.LabelEncoder()
combined_df['Type1'] = le.fit_transform(combined_df.Type.values)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1710 entries, 0 to 189
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           1710 non-null   int64  
 1   City                           1710 non-null   object 
 2   Population (in Lakhs) (2011)+  1710 non-null   float64
 3   Type                           1710 non-null   object 
 4   Crime Rate                     1520 non-null   float64
 5   train                          1520 non-null   float64
 6   test                           190 non-null    float64
 7   City1                          1710 non-null   int32  
 8   Type1                          1710 non-null   int32  
dtypes: float64(4), int32(2), int64(1), object(2)
memory usage: 120.2+ KB


In [68]:
combined_df2 = combined_df.sort_values(by=["Type","Type1"])
Unique_Type_Type1 = combined_df2[['Type', 'Type1']].drop_duplicates()
print(Unique_Type_Type1)

                              Type  Type1
608   Crime Committed by Juveniles      0
912               Crime against SC      1
1064              Crime against ST      2
760   Crime against Senior Citizen      3
456         Crime against children      4
304            Crime against women      5
1368                  Cyber Crimes      6
1216             Economic Offences      7
152                     Kidnapping      8
0                           Murder      9


In [69]:
combined_df = combined_df.drop(['City', 'Type'], axis=1)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1710 entries, 0 to 189
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           1710 non-null   int64  
 1   Population (in Lakhs) (2011)+  1710 non-null   float64
 2   Crime Rate                     1520 non-null   float64
 3   train                          1520 non-null   float64
 4   test                           190 non-null    float64
 5   City1                          1710 non-null   int32  
 6   Type1                          1710 non-null   int32  
dtypes: float64(4), int32(2), int64(1)
memory usage: 93.5 KB


In [70]:
# Renaming columns
combined_df.rename(columns={'City1': 'City', 'Type1': 'Type'}, inplace=True)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1710 entries, 0 to 189
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           1710 non-null   int64  
 1   Population (in Lakhs) (2011)+  1710 non-null   float64
 2   Crime Rate                     1520 non-null   float64
 3   train                          1520 non-null   float64
 4   test                           190 non-null    float64
 5   City                           1710 non-null   int32  
 6   Type                           1710 non-null   int32  
dtypes: float64(4), int32(2), int64(1)
memory usage: 93.5 KB


In [71]:
combined_df.head()

Unnamed: 0,Year,Population (in Lakhs) (2011)+,Crime Rate,train,test,City,Type
0,2014,63.5,1.291339,1.0,,0,9
1,2015,63.5,1.480315,1.0,,0,9
2,2016,63.5,1.622047,1.0,,0,9
3,2017,63.5,1.417323,1.0,,0,9
4,2018,63.5,1.543307,1.0,,0,9


In [72]:
#Rearranging the "City" column
second_column = combined_df.pop('City') 
combined_df.insert(1, 'City', second_column) 
combined_df.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Crime Rate,train,test,Type
0,2014,0,63.5,1.291339,1.0,,9
1,2015,0,63.5,1.480315,1.0,,9
2,2016,0,63.5,1.622047,1.0,,9
3,2017,0,63.5,1.417323,1.0,,9
4,2018,0,63.5,1.543307,1.0,,9


**6) Seggregation of Training and Customer file after combined pre-processing**

In [73]:
train_df1 = combined_df[combined_df["train"] == 1]
cust_df1 = combined_df[combined_df["test"] == 0]
train_df1.drop(["train", "test"], axis=1, inplace=True)
cust_df1.drop(["test", "train", "Crime Rate"], axis=1, inplace=True)

In [74]:
train_df1.shape

(1520, 5)

In [75]:
train_df1.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Crime Rate,Type
0,2014,0,63.5,1.291339,9
1,2015,0,63.5,1.480315,9
2,2016,0,63.5,1.622047,9
3,2017,0,63.5,1.417323,9
4,2018,0,63.5,1.543307,9


In [76]:
cust_df1.shape

(190, 4)

In [77]:
cust_df1.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type
0,2025,0,63.5,9
1,2025,1,85.0,9
2,2025,2,87.0,9
3,2025,3,21.5,9
4,2025,4,163.1,9


**7) Data Slicing and Data preparation for applying finalized ML Regression algorithm DecisionTreesRegressor**

In [78]:
X = train_df1.drop(['Crime Rate'], axis = 1)
y = train_df1['Crime Rate']

In [79]:
X.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type
0,2014,0,63.5,9
1,2015,0,63.5,9
2,2016,0,63.5,9
3,2017,0,63.5,9
4,2018,0,63.5,9


In [80]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 66)

In [81]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(1216, 4)
(1216,)
(304, 4)
(304,)


**8) Build the ML Regression model using finalized regression algorithm "DecisionTreeRegressor and evaluate the ML Model**

In [82]:
from sklearn.tree import DecisionTreeRegressor
#DecisionTreeRegressor class has many parameters. Input only random_state=0 or 42.
DTR = DecisionTreeRegressor(random_state=0)
#Fit the regressor object to the dataset.
DTR.fit(X_train,y_train)

In [83]:
DTR_y_pred = DTR.predict(X_test)

In [84]:
DTR_Training_Acc = DTR.score(X_train,y_train)
#DTR_Training_Acc=DTR_Training_Accuracy.round(4)
DTR_Testing_Acc = r2_score(y_test,DTR_y_pred)
#DTR_Testing_Acc=DTR_Testing_Accuracy.round(4)

In [85]:
print("Training Accuracy :", DTR_Training_Acc)
print("Testing Accuracy :", DTR_Testing_Acc)

Training Accuracy : 1.0
Testing Accuracy : 0.9329925803089038


In [86]:
DTR_Train_Prediction = DTR.predict(X_train)
print(DTR_Train_Prediction[:5])

[ 8.89250814 12.82352941  7.43678161  0.13100437 12.        ]


In [87]:
DTR_Training_Error = np.sqrt(mean_squared_error(y_train,DTR_Train_Prediction))
DTR_Training_Err=DTR_Training_Error.round(2)
DTR_Testing_Error = np.sqrt(mean_squared_error(y_test, DTR_y_pred))
DTR_Testing_Err=DTR_Testing_Error.round(2)

In [88]:
print("Training Error :", DTR_Training_Err)
print("Testing Error :", DTR_Testing_Err)

Training Error : 0.0
Testing Error : 5.52


**9) Predict the Crime Rate values for the data given in Customer data file and add this info to the Customer Data file as a new column**

In [89]:
cust_df1.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type
0,2025,0,63.5,9
1,2025,1,85.0,9
2,2025,2,87.0,9
3,2025,3,21.5,9
4,2025,4,163.1,9


In [90]:
# Predicting the Customer Dataset results
DTR_cust_pred = DTR.predict(cust_df1)

In [91]:
cust_df.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type,test
0,2025,Ahmedabad,63.5,Murder,0
1,2025,Bengaluru,85.0,Murder,0
2,2025,Chennai,87.0,Murder,0
3,2025,Coimbatore,21.5,Murder,0
4,2025,Delhi,163.1,Murder,0


In [92]:
cust_df.drop(['test'], axis = 1, inplace=True)
cust_df.head()

Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type
0,2025,Ahmedabad,63.5,Murder
1,2025,Bengaluru,85.0,Murder
2,2025,Chennai,87.0,Murder
3,2025,Coimbatore,21.5,Murder
4,2025,Delhi,163.1,Murder


In [93]:
cust_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190 entries, 0 to 189
Data columns (total 4 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Year                           190 non-null    int64  
 1   City                           190 non-null    object 
 2   Population (in Lakhs) (2011)+  190 non-null    float64
 3   Type                           190 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.1+ KB


In [94]:
cust_df["Predicted_CRPI"]=DTR_cust_pred

**10) Estimating the Population of the City based on the Year for which the CRPI needs to be predicted**

In [95]:
year_diff = cust_df['Year'] - 2011
pop1 =(0.01*year_diff)*cust_df['Population (in Lakhs) (2011)+']
cust_df['Est-Pop-for-Given-Year']= cust_df['Population (in Lakhs) (2011)+']+pop1

**11) Calculating the Total number of crime cases for the given data**

In [96]:
## Calculating the Total number of crime cases for the given data
cases = cust_df['Est-Pop-for-Given-Year']*cust_df["Predicted_CRPI"]
cases1 = np.round(cases)
cust_df["Pred-Cases"] = cases1

In [97]:
result = []
for value in cust_df["Predicted_CRPI"]:
    if value <= 1:
        result.append("Very Low Crime Area")
    elif value <= 5:
        result.append("Low Crime Area")
    elif value <= 15:
        result.append("High Crime Area")    
    else:
        result.append("Very High Crime Area")
      
cust_df["Crime_Status"] = result   
cust_df.head()


Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type,Predicted_CRPI,Est-Pop-for-Given-Year,Pred-Cases,Crime_Status
0,2025,Ahmedabad,63.5,Murder,1.527559,72.39,111.0,Low Crime Area
1,2025,Bengaluru,85.0,Murder,2.105882,96.9,204.0,Low Crime Area
2,2025,Chennai,87.0,Murder,1.850575,99.18,184.0,Low Crime Area
3,2025,Coimbatore,21.5,Murder,1.395349,24.51,34.0,Low Crime Area
4,2025,Delhi,163.1,Murder,2.783568,185.934,518.0,Low Crime Area


**12) Writing out the Customer data file with predicted values of Crime Rate, Total Number of cases, Population of the City for the year for which Prediction needs to be done**

In [98]:
finalDF= cust_df
print(finalDF.shape)
finalDF.head()

(190, 8)


Unnamed: 0,Year,City,Population (in Lakhs) (2011)+,Type,Predicted_CRPI,Est-Pop-for-Given-Year,Pred-Cases,Crime_Status
0,2025,Ahmedabad,63.5,Murder,1.527559,72.39,111.0,Low Crime Area
1,2025,Bengaluru,85.0,Murder,2.105882,96.9,204.0,Low Crime Area
2,2025,Chennai,87.0,Murder,1.850575,99.18,184.0,Low Crime Area
3,2025,Coimbatore,21.5,Murder,1.395349,24.51,34.0,Low Crime Area
4,2025,Delhi,163.1,Murder,2.783568,185.934,518.0,Low Crime Area


In [99]:
#from google.colab import files
#finalDF.to_csv("gdrive/My Drive/NCJ-MLP-Training-2022/NCJ-MLP-Projects-Latest/01-BHP-Project/Output-Files/BHP_Customer_data_with_predicted_Price2.csv", index = False)
finalDF.to_csv("D:/CRPI-Latest1/Output-Files/0003B_Cust_Data_with_Pred_values.csv", index=False)

**13) Write the Trained ML model as a Pickle file into local folder so that it can be used by the Frontend programs**

In [100]:
import pickle
#now you can save it to a file
#file = '/content/gdrive/My Drive/NCJ-MLP-Training-2022/NCJ-MLP-Projects-Latest/01-BHP-Project/Pickle-File/ML_Model_BHP.pkl'
file = 'D:/CRPI-Latest1/Pickle-File/0003B_ML_Model_CRPI_DTC.pkl'

with open(file, 'wb') as f:
    pickle.dump(DTR, f)

**14) Verify the working of the Pickle by using test data**

In [101]:
with open(file, 'rb') as f:
    k = pickle.load(f)

In [102]:
cy = k.predict([[2025,	0,	63.5, 7]])
print(cy)

[12.04724409]
