# ChatGPT User Reviews: Data Pre-processing

**Problem Statement**: To enhance the customer experience for (online) products, this project will analyze user reviews on ChatGPT from the iOS store. By leveraging NLP techniques, I aim to classify overall sentiment, extract product-related feedback and identify trends in user satisfaction. Insights from this analysis will inform actionable recommendations to improve a product’s usability, functionality and overall satisfaction.

**Background:** So why is looking at user feedback and keeping user-centric thinking important? Sentiment analysis helps companies identify the tone and emotion behind reviews. This can ultimately help provide insights into customer satisfaction and dissatisfaction. This understanding can be invaluable for companies, especially early stage companies because it highlights areas that need improvement. By analyzing feedback, companies are able to pinpoint product features that are performing well or poorly and allocate tasks accordingly. For startups especially, resources are spread out extremely thin, so it is important that a company is able to move efficiently and prioritize improvements or innovations based on user needs. It is also incredibly important for a company to establish its brand as being user-centric and foster customer loyalty. By heeding to customer complaints, a company can show that they care about their user and want to implement solutions that improve the user experience. Manually analyzing thousands of reviews is impractical and inefficient. Sentiment analysis automates the process and allows a company to extract actionable insights from a large dataset. 

**Sources:**
> Note: All function docstrings are written with the help of ChatGPT

### Imports

In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

### Read in Data 

In [2]:
file_path = '../data/chatgpt_reviews.csv'
df = pd.read_csv(file_path)

In [3]:
df.head()

Unnamed: 0,date,title,review,rating
0,2023-05-21 16:42:24,Much more accessible for blind users than the ...,Up to this point I’ve mostly been using ChatGP...,4
1,2023-07-11 12:24:19,"Much anticipated, wasn’t let down.",I’ve been a user since it’s initial roll out a...,4
2,2023-05-19 10:16:22,"Almost 5 stars, but… no search function",This app would almost be perfect if it wasn’t ...,4
3,2023-05-27 21:57:27,"4.5 stars, here’s why","I recently downloaded the app and overall, it'...",4
4,2023-06-09 07:49:36,"Good, but Siri support would take it to the ne...",I appreciate the devs implementing Siri suppor...,4


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2292 entries, 0 to 2291
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    2292 non-null   object
 1   title   2292 non-null   object
 2   review  2292 non-null   object
 3   rating  2292 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 71.8+ KB


#### Check for missing values 

In [5]:
df.isnull().sum()

date      0
title     0
review    0
rating    0
dtype: int64

### Preprocess the user reviews

In [6]:
def preprocess(text):
    """
    Preprocesses the input text by:
    1. Removing punctuation.
    2. Converting all characters to lowercase.
    3. Tokenizing the text into words.
    4. Removing stopwords.

    Parameters:
    -----------
    text : str
        The input text to preprocess.

    Returns:
    --------
    str
        The preprocessed text as a string with tokens joined by spaces.
    """
    # Use regex to remove any special characters ot punctuation
    text = re.sub(r'[^\w\s]', '', text) 
    # Lower case all the text 
    text = text.lower() 
    # Toenize the text 
    tokens = word_tokenize(text)  
    # Remove all english stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]  
    return " ".join(tokens)

In [7]:
df['text'] = df['title'] + ' ' + df['review']

In [8]:
df['processed_text'] = df['text'].apply(preprocess)

In [9]:
# Check to make sure that the text has been cleaned 
df[['text', 'processed_text']].head()

Unnamed: 0,text,processed_text
0,Much more accessible for blind users than the ...,much accessible blind users web version point ...
1,"Much anticipated, wasn’t let down. I’ve been a...",much anticipated wasnt let ive user since init...
2,"Almost 5 stars, but… no search function This a...",almost 5 stars search function app would almos...
3,"4.5 stars, here’s why I recently downloaded th...",45 stars heres recently downloaded app overall...
4,"Good, but Siri support would take it to the ne...",good siri support would take next level apprec...


In [10]:
df.head()

Unnamed: 0,date,title,review,rating,text,processed_text
0,2023-05-21 16:42:24,Much more accessible for blind users than the ...,Up to this point I’ve mostly been using ChatGP...,4,Much more accessible for blind users than the ...,much accessible blind users web version point ...
1,2023-07-11 12:24:19,"Much anticipated, wasn’t let down.",I’ve been a user since it’s initial roll out a...,4,"Much anticipated, wasn’t let down. I’ve been a...",much anticipated wasnt let ive user since init...
2,2023-05-19 10:16:22,"Almost 5 stars, but… no search function",This app would almost be perfect if it wasn’t ...,4,"Almost 5 stars, but… no search function This a...",almost 5 stars search function app would almos...
3,2023-05-27 21:57:27,"4.5 stars, here’s why","I recently downloaded the app and overall, it'...",4,"4.5 stars, here’s why I recently downloaded th...",45 stars heres recently downloaded app overall...
4,2023-06-09 07:49:36,"Good, but Siri support would take it to the ne...",I appreciate the devs implementing Siri suppor...,4,"Good, but Siri support would take it to the ne...",good siri support would take next level apprec...


### Save the cleaned reviews as a new .csv file

In [11]:
df.to_csv('../data/chatgpt_cleaned.csv', index=False)