<a href="https://colab.research.google.com/github/panaku88/MCS-7103-Machine-Learning/blob/main/EDA_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Here I am importing the Google Drive Python Library which is used to connect to colab

In [100]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


2. Importing the necessary Python Libraries discussed in the report write up

In [101]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
import random

3. Reading the raw customer dataset into a Pandas Dataframe and specifying a custom directory (output_path) to store modified dataset

In [103]:
raw_dataset = pd.read_csv('/content/drive/MyDrive/MCSC1/dataset/CS_Service_Data.csv')

# Specify the path to save the modified/manipulated dataset
output_path = '/content/drive/MyDrive/MCSC1/dataset/customer_support_dataset.csv'


4. Here next code cells, I am assessing the Dataset in order to understand the structure, content and check if there are problems in the dataset

In [None]:
raw_dataset.head()

In [None]:
raw_dataset.info()

In [None]:
raw_dataset.describe()

In [None]:
raw_dataset.isnull().sum()

In [None]:
raw_dataset.shape

In [None]:
raw_dataset.columns

5. The following are custom Python code to manipulate the dataset. Precisely, we are removing any sensitive information from the data

In [105]:
customer_names = {}
syllables = ['a', 'e', 'i', 'o', 'u', 'ka', 'ko', 'sa', 'tu', 'ma', 'me', 'mi', 'mo', 'mu', 'ya', 'ye', 'yi', 'yo', 'yu', 'ra', 're', 'ri', 'ro', 'ru', 'wa', 'we', 'wi', 'wo', 'wu']

def generate_name(min_length=3, max_length=6):
  name = ''
  length = random.randint(min_length, max_length)
  for i in range(length):
    name += random.choice(syllables)
  return name.capitalize()

def generate_customer_name(row):
  name = row['CUSTOMER NAME']
  account = row['CUSTOMER ACCOUNT']
  if isinstance(name, str):
    if account in customer_names:
      return customer_names[account]
    else:
      if 'Mr' in name or 'Ms' in name:
        title = random.choice(['Mr', 'Ms'])
        first_name = generate_name()
        last_name = generate_name()
        new_name = f'{title} {first_name} {last_name}'
      elif 'Company' in name or 'Ltd' in name or 'Inc' in name:
        new_name = generate_name() + ' Inc'
      else:
        first_name = generate_name()
        last_name = generate_name()
        new_name = f'{first_name} {last_name}'
      customer_names[account] = new_name
      return new_name
  else:
    return name

raw_dataset['CUSTOMER NAME'] = raw_dataset.apply(generate_customer_name, axis=1)


# Define a function to replace senstive incident ID
def replace_rke_with_tt(text):
    # Check if the value is a string
    if isinstance(text, str):
        # Replace any word starting with 'RKE' with 'TT' but keep the rest of the word unchanged
        return re.sub(r'\bRKE(\w*)\b', r'TT\1', text)
    else:
        # Return the original value if it's not a string
        return text


def replace_service_plan(text):
  if isinstance(text, str):
    if re.match(r'CAPPED-BASE: Roke Capped Base', text):
      return 'SONIC HOME PRO 25Mbps'
    else:
      return text
  else:
    return text

def replace_service_plan_ent(text):
  if isinstance(text, str):
    match = re.search(r'RE(\d+): Roke Enterprise', text)
    if match:
      number = match.group(1)
      return f'SONIC BUSINESS {number}Mbps'
    else:
      return text
  else:
    return text


def replace_service_plan_vpn(text):
  if isinstance(text, str):
    if 'VPN' in text or 'vpn' in text:
      random_number = random.randint(1, 100)
      return f'SONIC MPLS VPN {random_number}Mbps'
    else:
      return text
  else:
    return text


6. Data cleanning: After understanding the structure and content in step 5 I realized that it was necessary to clean the data as there were some missing values and some attributes that were not important for my purpose.

In [106]:
# Save the modified dataset to the specified path
raw_dataset.to_csv(output_path, index=False)

print(f"The clean_cs_dataset.csv has been created successfully at {output_path}.")
new_dataset = pd.read_csv('/content/drive/MyDrive/MCSC1/dataset/customer_support_dataset.csv')

# Drop rows with 'Not Specified' or 'Not Selected'.
new_dataset = new_dataset[new_dataset.applymap(lambda x: 'Not Specified' not in
                                               str(x) and 'Not Selected' not in str(x) and 'Shared Bandwidth' not in str(x)).all(axis=1)]

# Drop the unnecessary attributes
new_dataset = new_dataset.dropna(subset=['TICKET OWNER'])
new_dataset = new_dataset.drop('CURRENT STATUS', axis=1)
new_dataset = new_dataset.drop('ALLOCATED TIME (HOURS)', axis=1)

# here i am calling the above functions to manipulate the dataset
raw_dataset['SERVICE PLAN'] = raw_dataset['SERVICE PLAN'].apply(replace_service_plan_vpn)
raw_dataset['SERVICE PLAN'] = raw_dataset['SERVICE PLAN'].apply(replace_service_plan_ent)
raw_dataset['SERVICE PLAN'] = raw_dataset['SERVICE PLAN'].apply(replace_service_plan)
raw_dataset['TICKET NUMBER'] = raw_dataset['TICKET NUMBER'].str.replace(r'^RKE', 'TT', regex=True)


# Save the modified dataset to the specified path
new_dataset.to_csv(output_path, index=False)

The clean_cs_dataset.csv has been created successfully at /content/drive/MyDrive/MCSC1/dataset/customer_support_dataset.csv.


In [None]:
new_dataset.info()

In [None]:
new_dataset.head()

In [None]:
new_dataset.tail()

In [None]:
new_dataset.describe()

In [None]:
new_dataset.isnull().sum()

In [None]:
new_dataset.shape

In [None]:
new_dataset.columns

In [None]:
new_dataset.dtypes

In [None]:
new_dataset.nunique()

In [None]:
new_dataset.duplicated().sum()