This notebook is for preprocessing the data and getting it ready to be clustered. This is done by exploring the data, getting rid of outliers, 
and converting data to binary to only check for presence of a feature. I chose to use a variance threshhold for some feature reduction by only 
getting rid of features where all values were identical. If a feature is just the exact same (present or not present) for every sample of data, it 
contriubutes nothing but noise. I chose not to use Principal Componenet Analysis (PCA) because it assumes the data is continuouse, not binary, causing 
potential issues due to assumptions. Multiple Correspondence Analysis (MCA) would also work for dimensionality reduction, but would likely cause a loss 
in information.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold
import re

In [3]:
df = pd.read_csv("../data/parquets/raw_parquet.csv")

  df = pd.read_csv("../data/parquets/raw_parquet.csv")


In [14]:
#use this to see value counts of each column. The builtin value_counts function in the pandas libary is hard to read, so with val_count, it makes it a bit easier

def val_count(data):
    counter = 0
    for i in data.columns:
        print(counter)
        print(data[i].value_counts())
        counter += 1
        print("=========================="*5)
    print(data.shape)

In [None]:
val_count(df)

Some features may be unnecessary in the in analysis, especially features that pertain more to the use of the app and its features like line and circle colors for example. It is also a very necessary step to try and reduce the cardinality of the data when possible so as to reduce the noise later on and improve processing time.

In [18]:
columns_to_drop = []
for col in df.columns:
    if re.search("^properties_symbology", col):
        columns_to_drop.append(col)
    if re.search("_id$", col):
        columns_to_drop.append(col)
    if re.search("_timestamp$", col):
        columns_to_drop.append(col)

df = df.drop(columns= columns_to_drop)

print("Columns Dropped:")
for col in columns_to_drop:
    print(f"\t{col}")



Columns Dropped:
	properties_viewed_timestamp
	properties_modified_timestamp
	properties_symbology_circleColor
	properties_id
	properties_symbology_lineColor
	properties_symbology_lineWidth
	properties_symbology_lineDasharray
	properties_sed_strat_section_strat_section_id
	properties_strat_section_id
	properties_symbology_fillColor
	properties_orientation_id
	properties_custom_fields_osm_id
	properties_orientation_modified_timestamp
	properties_custom_fields_id
	properties_orientation_unix_timestamp


In [20]:
#getting rid of duplicates

df.drop_duplicates(inplace = True)
df.shape

(1359415, 465)

In [21]:
def binary_simplification(df):
    """converts a pandas dataframe into binary based off of presence in data

    Args:
        df (pandas.DataFrame): a pandas dataframe of the data that needs to be converted to binary

    Returns:
        pandas.DataFrame: a new dataframe that has now been converted to binary
    """
    df_new = df.copy()
    binary_col_data = {}
    columns_to_drop = []
    
    for col in df.columns:
        # print(f"Converting {col} to binary")
        binary_col_data[col] = df_new[col].replace('', np.nan).notna().astype(int)
        columns_to_drop.append(col)
            
    df_new = df_new.drop(columns=columns_to_drop)
    
    # Add all new binary columns in one go using pd.concat
    if binary_col_data:
        df_new = pd.concat([df_new, pd.DataFrame(binary_col_data, index=df_new.index)], axis=1)
        
    return df_new

In [23]:
df = binary_simplification(df)
# val_count(df)

In [24]:
# uses variance threshold for feature reduction, removes features where all values are identical

selector = VarianceThreshold(threshold=0) #removes all features with low variance in 100% of samples  (you do (1 - percentage of same values) * percentage of vaues that are the same)
selector.fit_transform(df)

cols_idxs = selector.get_support(indices=True)
df = df.iloc[:,cols_idxs]

print(df.shape)
# val_count(df)


(1359415, 442)


In [25]:
df.to_csv("../data/parquets/processed_parquet.csv", index = False)
