## Data Analysis

In this section, we will perform various data analysis tasks on the Yelp reviews dataset. This includes descriptive statistics, correlation analysis, grouping and aggregation, and trend analysis.

Let's first load the dataset.

In [None]:
# Importing necessary libraries
import os  # Module for interacting with the operating system
import pandas as pd  # Library for data manipulation and analysis
import numpy as np  # Library for numerical computations

file_path = '../Datasets/Yelp_Reviews/reviews.csv'  # Define the path to the CSV file

reviews = pd.read_csv(file_path)  # Read the CSV file into a DataFrame

reviews.head()  # Display the first few rows of the DataFrame

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",7/7/2018 22:09
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,1/3/2012 15:28
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2/5/2014 20:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",1/4/2015 0:01
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,1/14/2017 20:54


### Summary statistics

#### Distribution of Ratings
We analyze the distribution of ratings in the Yelp reviews dataset. Understanding the distribution of ratings can provide insights into customer satisfaction and help identify patterns or trends in the feedback.

- **rating_distribution = reviews['stars'].value_counts().sort_index()**:
    - **reviews['stars']**: Select the 'stars' column from the DataFrame `reviews`, which contains the ratings given in the reviews.
    - **value_counts()**: Counts the occurrence of each unique rating value, giving us the number of reviews for each rating.
    - **sort_index()**: Sorts the counts by the rating values (index) in ascending order.

- **rating_distribution**:
    - This variable now holds a Series with the count of reviews for each rating, sorted by the rating values. It provides a clear view of how many reviews were given for each rating level (e.g., 1 star, 2 stars, etc.).

In [None]:
# Distribution of Ratings
rating_distribution = reviews['stars'].value_counts().sort_index()  # Count the occurrences of each rating and sort by rating value

rating_distribution  # Display the distribution of ratings

stars
1    222
2    148
3    236
4    474
5    883
Name: count, dtype: int64

<div style="background-color: #00008B; padding: 10px; color: white; padding: 10px;">

Now you can try to do something on your own with GenAI!

🤖 **Suggested prompt**
<br>

This is my data:

`<data>`

{paste data}

`</data>`

In my lesson, we ran this code:

`rating_distribution = reviews['stars'].value_counts().sort_index()  # Count the occurrences of each rating and sort by rating value`

Given my dataset, what are some other summary statistics I could look at?

</div>

In [None]:
# paste your AI-generated code here to give it a go:


### Statistical Analysis
Below are some basic statistical analyses you can do on your dataset. 

#### Correlation Analysis: Review Length vs Useful Votes
Here we analyze the statistical correlation between the length of a review and the number of useful votes it receives. By examining this correlation, we can understand if there is a linear relationship between these two variables.

1. Calculate Review Length:

In [None]:
# Create a new column 'review_length' which is the length of each review in characters
reviews['review_length'] = reviews['text'].apply(len)


- **reviews['text']**: Selects the 'text' column from the DataFrame reviews, which contains the review texts.
- **apply(len)**: Applies the len function to each review text, calculating the length of each review in terms of the number of characters.
- **reviews['review_length']**: Creates a new column 'review_length' in the DataFrame reviews to store the length of each review.

2. **Select Relevant Columns**:

In [None]:
# Create a new DataFrame with the relevant columns for correlation analysis
review_length_vs_useful = reviews[['review_length', 'useful']]


- **reviews[['review_length', 'useful']]**: Selects the 'review_length' and 'useful' columns from the DataFrame `reviews` and creates a new DataFrame `review_length_vs_useful` containing these two columns.

3. **Calculate and Display the Correlation**:

In [None]:
# Calculate the correlation between review length and useful votes
correlation = review_length_vs_useful.corr()

# Display the correlation matrix
print(correlation)

               review_length    useful
review_length       1.000000  0.280841
useful              0.280841  1.000000


- **review_length_vs_useful.corr()**: Computes the pairwise correlation of columns in the DataFrame `review_length_vs_useful`.
- **print(correlation)**: Displays the correlation matrix, showing the correlation coefficient between 'review_length' and 'useful'.

The correlation coefficient will provide a measure of how strongly review length and useful votes are related. A coefficient close to 1 indicates a strong positive correlation, close to -1 indicates a strong negative correlation, and close to 0 indicates no correlation.

Here’s the complete code for the correlation analysis:

In [None]:
# Calculate Review Length
reviews['review_length'] = reviews['text'].apply(len)

# Select Relevant Columns
review_length_vs_useful = reviews[['review_length', 'useful']]

# Calculate and Display the Correlation
correlation = review_length_vs_useful.corr()
print(correlation)

               review_length    useful
review_length       1.000000  0.280841
useful              0.280841  1.000000


#### Group Comparison: Average Star Ratings
Here, we will analyze the star ratings of different businesses using Analysis of Variance (ANOVA). In this lab, we will guide you step by step through the process of renaming businesses, filtering data, and performing ANOVA to compare the ratings.

##### Step 1: Load the Dataset

First, we need to load the dataset containing business reviews. We already did this abov, but we can import some additional libraries for the statistical analyses.

In [None]:
# Import necessary libraries
import pandas as pd  # Pandas is used for data manipulation and analysis.
from scipy.stats import f_oneway  # Scipy's f_oneway function is used to perform ANOVA.

file_path = '../Datasets/Yelp_Reviews/reviews.csv'  # Define the path to the CSV file

reviews = pd.read_csv(file_path)  # Read the CSV file into a DataFrame

reviews.head()  # Display the first few rows of the DataFrame

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",7/7/2018 22:09
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,1/3/2012 15:28
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2/5/2014 20:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",1/4/2015 0:01
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,1/14/2017 20:54



- **Pandas (`pd`)**: A powerful data manipulation library in Python. It provides data structures and functions needed to manipulate structured data.
- **SciPy (`f_oneway`)**: A Python library used for scientific and technical computing. Here, we use the `f_oneway` function from SciPy's stats module to perform ANOVA.

##### Step 2: Identify Top 10 Businesses

Next, we will identify the top 10 businesses based on the number of reviews.

In [None]:
# Count the number of reviews for each business and sort in descending order
business_review_counts = reviews['business_id'].value_counts()
top_businesses = business_review_counts.head(10)

# Display the top 10 businesses
print(top_businesses)

business_id
GBTPC53ZrG1ZBY3DT8Mbcw    17
pSmOH4a3HNNpYM82J5ycLA    12
PY9GRfzr4nTZeINf346QOw    11
W4ZEKkva9HpAdZG88juwyQ    10
Zi-F-YvyVOK0k5QD7lrLOg     9
EtKSTHV5Qx_Q7Aur9o4kQQ     9
Dv6RfXLYe1atjgz3Xf4GGw     8
SZU9c8V2GuREDN5KgyHFJw     7
9gObo5ltOMo6UgsaXaHPWA     7
VRGYwKE_Z77frm5NwLvJhw     7
Name: count, dtype: int64


- **`value_counts()`**: A function that counts the unique values in a column.
- **`head(10)`**: Selects the top 10 entries from the Series.

##### Step 3: Rename Top 10 Businesses

We will rename the top 10 businesses as "Business 1", "Business 2", etc., for easier analysis.

In [None]:
# Rename the top 10 businesses as businesses A-J for alphabetical sorting
business_labels = ['Business_A', 'Business_B', 'Business_C', 'Business_D', 'Business_E', 'Business_F', 'Business_G', 'Business_H', 'Business_I', 'Business_J']
top_10_businesses = top_businesses.index[:10].tolist()
business_mapping = {business_id: business_labels[i] for i, business_id in enumerate(top_10_businesses)}

# Create a new column 'business_label' with the new business names
reviews['business_label'] = reviews['business_id'].map(business_mapping)

# Verify that the business labels have been correctly assigned by displaying the top 10 businesses
top_10_renamed = reviews[reviews['business_id'].isin(top_10_businesses)][['business_id', 'business_label']].drop_duplicates()
top_10_renamed = top_10_renamed.sort_values(by='business_label').reset_index(drop=True)
print(top_10_renamed)

              business_id business_label
0  GBTPC53ZrG1ZBY3DT8Mbcw     Business_A
1  pSmOH4a3HNNpYM82J5ycLA     Business_B
2  PY9GRfzr4nTZeINf346QOw     Business_C
3  W4ZEKkva9HpAdZG88juwyQ     Business_D
4  Zi-F-YvyVOK0k5QD7lrLOg     Business_E
5  EtKSTHV5Qx_Q7Aur9o4kQQ     Business_F
6  Dv6RfXLYe1atjgz3Xf4GGw     Business_G
7  SZU9c8V2GuREDN5KgyHFJw     Business_H
8  9gObo5ltOMo6UgsaXaHPWA     Business_I
9  VRGYwKE_Z77frm5NwLvJhw     Business_J


##### Step 4: Filter Data for Top 10 Businesses

We will filter the dataset to include only reviews for the top 10 businesses.

In [None]:
# Filter the dataset to include only the top 10 businesses
top_10_reviews = reviews[reviews['business_label'].notnull()]

# Display the filtered DataFrame
print(top_10_reviews.head(10))

                  review_id                 user_id             business_id  \
37   pHwbdway4yeI-dSSmZA7-Q  qEEk0PuoH1dVa619t8fgpw  PY9GRfzr4nTZeINf346QOw   
44   jC-fGfx-YLqxVBcyTAd4Pw  EBa-0-6AKoy6jziNexDJtg  W4ZEKkva9HpAdZG88juwyQ   
49   cvQXRFLCyr0S7EgFb4lZqw  ZGjgfSvjQK886kiTzLwfLQ  EtKSTHV5Qx_Q7Aur9o4kQQ   
61   4zopEEPqfwm-c_FNpeHZYw  JYYYKt6TdVA4ng9lLcXt_g  SZU9c8V2GuREDN5KgyHFJw   
71   aAcQibR3zWOvk4atbCM3SA  7P9w2PrP4ZcJyDFwch51Ig  Zi-F-YvyVOK0k5QD7lrLOg   
81   7rCsR3SARVF3vXNiw_Csgg  mmdf_Fi-Hh_3uZN5zE164A  9gObo5ltOMo6UgsaXaHPWA   
108  yyrMqY7sNp5gT7KJ1AaYWA  pitYOVSsF8R1gWG1G0qxsA  GBTPC53ZrG1ZBY3DT8Mbcw   
113  3dVcGYz6GokuEytLrfG8bA  FEI0XkOrUHufSW_rfOTPAA  Dv6RfXLYe1atjgz3Xf4GGw   
119  S4nZgOgiv9w8MOiaWTpwBQ  8fPlzYWo0j_nQrJMeyF0Fw  pSmOH4a3HNNpYM82J5ycLA   
143  pJRn59F_lyNO1zT3TCVd0Q  TGgfqWnUaCf6DM7TLuNhDQ  pSmOH4a3HNNpYM82J5ycLA   

     stars  useful  funny  cool  \
37       4       0      0     0   
44       3       0      0     0   
49       5       3      1

##### Step 5: Perform ANOVA

Finally, we will perform ANOVA to compare the average star ratings of the top 10 businesses.

In [None]:
# Perform ANOVA on the star ratings of the top 10 businesses
groups = [top_10_reviews[top_10_reviews['business_label'] == f'Business_{letter}']['stars'] for letter in 'ABCDEFGHIJ']
f_stat, p_value = f_oneway(*groups)

# Display the ANOVA results
print(f"F-statistic: {f_stat}, p-value: {p_value}")

F-statistic: 0.8336292812943005, p-value: 0.5871479728518842


##### Explanation:

- **Loading the Dataset**: We load the data into a pandas DataFrame for easy manipulation.
- **Identifying Top 10 Businesses**: We count and sort the businesses by the number of reviews to find the top 10.
- **Renaming Businesses**: We map the top 10 businesses to new labels for simplicity.
- **Filtering Data**: We filter the DataFrame to include only the reviews of the top 10 businesses.
- **Performing ANOVA**: We use ANOVA to determine if there are significant differences in the average star ratings among the top 10 businesses.


<div style="background-color: #00008B; padding: 10px; color: white; padding: 10px;">

Try to generate another kind of analysis here. 

🤖 **Suggested prompt**
<br>

This is my data:

`<data>`

{paste data}

`</data>`

In my lesson, we just performed a correlation analysis and an ANOVA. What are some other statistical analyses I could perform on my dataset? Please provide the pros and cons of each method and explain briefly how to implement them in Python.

</div>

<div style="background-color: #00008B; padding: 10px; color: white; padding: 10px;">
🤖 
<br>
**Although the explanation of this code block is out of scope for this course, our Generative AI friend can certainly help!**

</div>

### Advanced: Sentiment Analysis

Thanks to GenAI tools, it's now possible for people without much experience to do analyses as advanced and specialized as Sentiment Analysis.

Sentiment analysis is the process of using natural language processing (NLP) and machine learning techniques to determine the emotional tone or sentiment expressed in a piece of text. It is widely used in areas such as social media monitoring, customer feedback analysis, and market research to understand the opinions and feelings of individuals.

And to get code for doing this, you could ask AI something like:

<div style="background-color: #00008B; padding: 10px; color: white; padding: 10px;">

🤖  **Example prompt:**

<br>
How can I perform sentiment analysis on my dataset of customer reviews in Python? Provide a Python code example that classifies the sentiment of each review as positive, negative, or neutral and calculates the distribution of these sentiments.

Here's a sample of my dataset

`<data>`

`{paste data}`

`</data>`

</div>

In [None]:
# Install the necessary libraries
# %pip install pandas textblob

import pandas as pd  # Import the pandas library for data manipulation and analysis
from textblob import TextBlob  # Import the TextBlob library for sentiment analysis

# Function to classify sentiment
def classify_sentiment(text):
    """
    This function takes a text string as input and uses TextBlob to analyze its sentiment.
    It returns 'Positive' if the polarity is greater than 0,
    'Negative' if the polarity is less than 0, and 'Neutral' if the polarity is 0.
    """
    analysis = TextBlob(text)  # Create a TextBlob object to analyze the sentiment of the text
    if analysis.sentiment.polarity > 0:  # Check if the polarity is greater than 0
        return 'Positive'  # Return 'Positive' for positive sentiment
    elif analysis.sentiment.polarity < 0:  # Check if the polarity is less than 0
        return 'Negative'  # Return 'Negative' for negative sentiment
    else:
        return 'Neutral'  # Return 'Neutral' for neutral sentiment

# Apply sentiment analysis to the 'text' column of the reviews DataFrame
reviews['sentiment'] = reviews['text'].apply(classify_sentiment)
# This line applies the classify_sentiment function to each element in the 'text' column
# and stores the result in a new column called 'sentiment'.

# Calculate the distribution of sentiment values
sentiment_distribution = reviews['sentiment'].value_counts()
# This line counts the occurrences of each sentiment category
# (Positive, Negative, Neutral) and stores the result in sentiment_distribution.

# Display the sentiment distribution
sentiment_distribution
# This line prints the distribution of sentiments to the output.


sentiment
Positive    1731
Negative     224
Neutral        8
Name: count, dtype: int64

<div style="background-color: #ADD8E6; padding: 10px;">

🤖 
<br>
**There are many more types of analysis that can be done on this dataset, it's all just a matter of which questions you want answered! For example- you might want to do sentiment analysis over time, i.e., analyze how the sentiment of reviews has changed over time to identify trends or shifts in customer satisfaction.**

</div>