### Libraries and data ingestion

In [1]:
from common_libraries import * 
import project_function


In [2]:

spark = SparkSession.builder \
    .master("local[*]") \
    .config("spark.sql.debug.catalog", False) \
    .config("spark.logLevel", "ERROR") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.cores", "4") \
    .config("spark.driver.memory", "4g") \
    .config("spark.driver.cores", "2") \
    .getOrCreate()


spark.sparkContext.setLogLevel("ERROR")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/28 18:23:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Data Preprocessing for Binary Classification

In [3]:
curated_df = spark.read.parquet("curated_data.parquet")

                                                                                

#### Binary classification is a type of supervised machine learning problem where the goal is to predict categorical target variables that can take one of two possible classes. One example of binary classification specific to this project is include identifying whether a crime if of robbery type or not. In this context, models like logistic regression, decision trees, and support vector machines are typically employed to estimate the probability that a given input belongs to a particular class, usually denoted as 1 for the positive class and 0 for the negative class. The performance of binary classification models is often evaluated using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC, which help to understand various aspects of the model's ability to correctly predict and distinguish between the two classes.

In [4]:
curated_df = curated_df.withColumn('robbery_crime_type', when(col('crm_cd_desc') == 'ROBBERY', 1).otherwise(0))

In [5]:
curated_df = curated_df.drop('crm_cd_desc','crm_cd')

In [6]:
#curated_df.select(col('crm_cd_desc')).distinct().show(40,truncate=False)
#curated_df.groupBy('crm_cd_desc').count().orderBy(col('count').desc()).show(truncate=False)

#### Assessing the balance of the class instances (one class is significantly more prevalent than the other)

In [7]:
#curated_df.toPandas().stalking_crime_type.value_counts(normalize=True)

In [8]:
robbery_type_counts = curated_df.groupBy('robbery_crime_type').count()

# Calculate the proportion of each crime type
proportions = robbery_type_counts.withColumn('proportion', F.col('count') / curated_df.count())

proportions.show()

+------------------+------+--------------------+
|robbery_crime_type| count|          proportion|
+------------------+------+--------------------+
|                 1| 31521|0.034050252776217434|
|                 0|894199|  0.9659497472237826|
+------------------+------+--------------------+



#### Assessing categorical feature vict_sex

In [9]:

curated_df.select('vict_sex').distinct().show()
#count().orderBy('count', ascending=False).first()['vict_sex']
#df.groupBy('vict_descent').count().orderBy('count', ascending=False).first()['vict_descent']


+--------+
|vict_sex|
+--------+
|       F|
| unknown|
|       M|
|       X|
|       H|
+--------+



#### Convert categorical features with one-hot encoding 

In [10]:

# StringIndexer to convert the 'vict_sex' column into numerical indices
stringIndexer = StringIndexer(inputCol="vict_sex", outputCol="vict_sex_type_indexed")

# OneHotEncoder to encode the numerical indices into one-hot encoded vectors
encoder = OneHotEncoder(inputCol="vict_sex_type_indexed", outputCol="vict_sex_type_encoded")

assembler = VectorAssembler(inputCols=['vict_age', 'crime_day_occ', 'crime_month_occ', 'crime_year_occ', 
                     
                                       'crime_day_rptd', 'crime_month_rptd', 'crime_year_rptd', 
                                       'vict_sex_type_indexed'], outputCol='features')
# Define a pipeline that includes both StringIndexer and OneHotEncoder
pipeline = Pipeline(stages=[stringIndexer, encoder,assembler])

# Fit the pipeline to the DataFrame and transform the DataFrame
curated_df_encoded = pipeline.fit(curated_df).transform(curated_df)



In [11]:
curated_df_encoded = curated_df_encoded.drop('vict_sex')


#### Subsetting datasets with features of interest for later use into SMOTE

In [12]:
data_model = curated_df_encoded.select('vict_age','crime_day_occ', 
                                 'crime_month_occ', 'crime_year_occ', 'crime_day_rptd', 
                                 'crime_month_rptd', 'crime_year_rptd', 
                                 'robbery_crime_type','vict_sex_type_indexed')

#### Fitting a model without scaling the data and ignoring class unbalance.

In [13]:
# Split the data into training and testing sets
train_data, test_data = curated_df_encoded.randomSplit([0.8, 0.2], seed=42)
# Define the logistic regression model
lr = LogisticRegression(featuresCol='features', labelCol='robbery_crime_type')

# Define a pipeline that includes StringIndexer, OneHotEncoder, VectorAssembler, and the logistic regression model
pipeline = Pipeline(stages=[lr])

# Fit the pipeline to the DataFrame
model = pipeline.fit(train_data)

# Transform the DataFrame using the pipeline
predictions = model.transform(test_data)

                                                                                

In [14]:
predictions.show(1,vertical=True)

-RECORD 0-------------------------------------
 crime_record_id       | 200100001            
 area                  | 1                    
 area_name             | Central              
 rpt_dist_no           | 111                  
 mocodes               | 0344                 
 vict_age              | 0                    
 vict_descent          | H                    
 premis_cd             | 108                  
 premis_desc           | PARKING LOT          
 status                | AA                   
 status_desc           | Adult Arrest         
 crm_cd_1              | 510                  
 location              | 500 N  FIGUEROA  ... 
 lat                   | 34.0617              
 lon                   | -118.2469            
 column_days_to_report | 1                    
 crime_day_occ         | 25                   
 crime_month_occ       | 1                    
 crime_year_occ        | 2020                 
 crime_day_rptd        | 26                   
 crime_month_

                                                                                

In [15]:
# Show some predictions
predictions.select("features", 'robbery_crime_type', "prediction", "probability").show(10)

# Evaluate the model using the BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol="robbery_crime_type", rawPredictionCol="rawPrediction", metricName="areaUnderROC")

# Compute the area under the ROC curve
auc = evaluator.evaluate(predictions)
print("Area under ROC: {:.4f}".format(auc))


+--------------------+------------------+----------+--------------------+
|            features|robbery_crime_type|prediction|         probability|
+--------------------+------------------+----------+--------------------+
|[0.0,25.0,1.0,202...|                 0|       0.0|[0.96371154371383...|
|[31.0,14.0,1.0,20...|                 0|       0.0|[0.96565364481987...|
|[0.0,4.0,1.0,2020...|                 0|       0.0|[0.96815823170409...|
|[34.0,19.0,1.0,20...|                 0|       0.0|[0.96173412740292...|
|[0.0,24.0,1.0,202...|                 0|       0.0|[0.95796554058383...|
|[49.0,19.0,1.0,20...|                 0|       0.0|[0.96443570711348...|
|[39.0,11.0,1.0,20...|                 0|       0.0|[0.96620712679914...|
|[0.0,11.0,1.0,202...|                 0|       0.0|[0.96130481830205...|
|[23.0,2.0,1.0,202...|                 0|       0.0|[0.96841640623762...|
|[24.0,3.0,1.0,202...|                 0|       0.0|[0.96807728051289...|
+--------------------+----------------

                                                                                

Area under ROC: 0.5358


In [16]:
# Initialize the evaluator object for classification metrics
evaluator = MulticlassClassificationEvaluator(labelCol='robbery_crime_type', predictionCol='prediction')

# Evaluate weighted precision
evaluator.setMetricName("weightedPrecision")
precision = evaluator.evaluate(predictions)
print("Precision:", precision)

# Evaluate weighted recall
evaluator.setMetricName("weightedRecall")
recall = evaluator.evaluate(predictions)
print("Recall:", recall)

# Evaluate F1 Score
evaluator.setMetricName("f1")
f1Score = evaluator.evaluate(predictions)
print("F1 Score:", f1Score)


                                                                                

Precision: 0.9329527042563608


                                                                                

Recall: 0.9658947687281264


[Stage 78:>                                                         (0 + 8) / 8]

F1 Score: 0.9491379895781019


                                                                                

In [13]:
#categorical_features = [curated_df.dtypes[value][0] for value in range(0,len(curated_df.columns)) if curated_df.dtypes[value][1]=='string']

In [14]:
#numerical_features = [curated_df.dtypes[value][0] for value in range(0,len(curated_df.columns)) if curated_df.dtypes[value][1]=='int' or curated_df.dtypes[value][1]=='double']


#### The high precision, recall, and F1 score, contrast with the low Area Under the Curve (AUC) and can be indicative of several issues, particularly class imbalance. Given that the dataset is heavily skewed towards the negative class, it is likely that the model has developed a bias towards predicting negatives. This bias could result in high precision and recall, as the majority of predictions correctly identify the prevalent class, but it might lead to a low AUC. The AUC metric evaluates the model's ability to differentiate between classes across various thresholds, and a low AUC suggests that the model struggles to effectively rank the positive class highly when it is indeed the correct class. This scenario highlights a potential overfitting to the negative class while failing to generalize well across the less frequent positive instances.

### Balance Class Distribution using SMOTE.

#### Class imbalance occurs when one class in a classification problem significantly outweighs the other class. It’s common in many machine learning problems. SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to balance class distributions by generating synthetic samples of the minority class. It works by creating new instances that are similar to existing minority class instances. This helps address imbalances in the dataset and improves the performance of machine learning models, especially those sensitive to class imbalance.

In [19]:
# Convert to Pandas DataFrame
train_data_pd = data_model.toPandas()

# Apply oversampling 
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(train_data_pd.drop('robbery_crime_type', axis=1), train_data_pd['robbery_crime_type'])

# Combine resampled features and target variable
resampled_df = pd.DataFrame(X_resampled, columns=train_data_pd.drop('robbery_crime_type', axis=1).columns)
resampled_df['robbery_crime_type'] = y_resampled

# Convert back to PySpark DataFrame
train_data_balanced = spark.createDataFrame(resampled_df).repartition(8)


#### Check the new proportions

In [20]:
robbery_type_counts = train_data_balanced.groupBy('robbery_crime_type').count()

# Calculate the proportion of each crime type
proportions = robbery_type_counts.withColumn('proportion', F.col('count') / resampled_df.shape[0])

proportions.show()



+------------------+------+----------+
|robbery_crime_type| count|proportion|
+------------------+------+----------+
|                 0|894199|       0.5|
|                 1|894199|       0.5|
+------------------+------+----------+



                                                                                

#### Creating a Vector Assembler which merges multiple columns into a single vector column. It's commonly used to assemble feature vectors for machine learning models in PySpark.

In [21]:
# Define the feature vector assembler
assembler = VectorAssembler(inputCols=['vict_age', 'crime_day_occ', 'crime_month_occ', 'crime_year_occ', 
                     
                                       'crime_day_rptd', 'crime_month_rptd', 'crime_year_rptd', 
                                       'vict_sex_type_indexed'], outputCol='features')

# Assemble features
data_assembled = assembler.transform(train_data_balanced)

# Split the data into training and testing sets
train_data, test_data = data_assembled.randomSplit([0.8, 0.2], seed=42)

# Scale features
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')



#### Now that the data preprocessing is complete, we fit the data into a logistic regression model to classify crimes as either robbery or non-robbery based on various features, evaluates its performance, and prints the accuracy of the model's predictions.

In [22]:
# Define Logistic Regression model
lr = LogisticRegression(featuresCol='scaled_features', labelCol='robbery_crime_type')

# Create pipeline
pipeline = Pipeline(stages=[scaler, lr])

# Fit the pipeline
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol='robbery_crime_type')
auc_roc= evaluator.evaluate(predictions)

print("AUC-ROC:", auc_roc)

                                                                                

AUC-ROC: 0.5526480480792335


In [23]:
# Initialize the evaluator object for classification metrics
evaluator = MulticlassClassificationEvaluator(
    labelCol='robbery_crime_type', 
    predictionCol='prediction',
    metricName='f1'  # Start with F1 Score
)

# Calculate F1 Score
f1Score = evaluator.evaluate(predictions)
print("F1 Score:", f1Score)

# Calculate Precision
evaluator.setMetricName("weightedPrecision")
precision = evaluator.evaluate(predictions)
print("Precision:", precision)

# Calculate Recall
evaluator.setMetricName("weightedRecall")
recall = evaluator.evaluate(predictions)
print("Recall:", recall)


                                                                                

F1 Score: 0.5391892170113776


                                                                                

Precision: 0.5427016961471699


[Stage 192:>                                                        (0 + 8) / 8]

Recall: 0.5417315656572718


                                                                                

### Let's interpret each of these metrics:

#### F1 Score: 0.5392
- Definition: The F1 score is the harmonic mean of precision and recall. It is a measure of a test's accuracy that considers both the precision and the recall.
- Interpretation: An F1 score of approximately 0.5392 is relatively low. This suggests that the balance between precision and recall is moderate, but overall, the effectiveness of the model in terms of both correctly identifying true positives and avoiding false positives is just above average.


#### Precision: 0.5427
- Definition: Precision is the ratio of correctly predicted positive observations to the total predicted positives. It answers the question, "Of all the instances the model labeled as positive, how many were actually positive?"
- Interpretation: A precision of 0.5427 indicates that approximately 54.27% of the model's positive predictions were correct. This implies that the model is moderately effective at ensuring that its positive predictions are accurate, but there is a significant rate of false positives.

#### Recall: 0.5417
- Definition: Recall, or sensitivity, is the ratio of correctly predicted positive observations to all observations in the actual class. It measures the model's ability to find all the relevant cases (positive instances).
- Interpretation: A recall of 0.5417 suggests that the model correctly identifies about 54.17% of all actual positives. This indicates a moderate ability to detect positive instances, missing nearly half of them.

#### AUC ROC: 0.5526
- Definition: The Area Under the Receiver Operating Characteristic Curve (AUC ROC) is a performance measurement for classification problems at various threshold settings. It tells how much the model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
- Interpretation: An AUC ROC of 0.5526 is only slightly better than a random guess (0.5). This indicates that the model has a limited ability to discriminate between the positive and negative classes.
### Overall Model Performance
The metrics indicate that the model performs only slightly better than random guessing, particularly highlighted by the AUC ROC value. The moderate F1 score, precision, and recall further suggest that while the model has some predictive power, it's not highly effective or reliable for making decisions based on its current state.
Possible alternative approaches include the following:

- Random oversampling
- Random undersampling
- Oversampling with SMOTE
- Oversampling with ADASYN
- Undersampling with Tomek Link
- Oversampling with SMOTE, then undersample with TOMEK Link (SMOTE-Tomek)
  
