<a href="https://colab.research.google.com/github/lnomkin/Text_Analysis_Final_Project/blob/main/Copy_of_311_Service_Request_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#311 Service Request Comparison: NYC Open Data and NYC/311 Subreddit 2020-2023, from the De Blasio to the Adams administrations

##1. Introduction: Understanding the public’s response to social service programming

  The pandemic exposed New York City’s deep inequalities, impacting job loss, education access, and homelessness. Today, economic challenges and the end of Covid-19 aid continue to affect low-income communities. The city’s recovery efforts varied under de Blasio and Adams.
  
  De Blasio, elected on a progressive platform, saw a 15% wage increase for the bottom earners and a 13% decrease in poverty, the lowest since 2005 [Family, 2022](https://www.icph.org/reports/family-homelessness-in-new-york-city-what-the-adams-administration-can-learn-from-previous-mayoralties/#keeping-new-yorkers-housed-homelessness-prevention-vouchers-and-housing).  His key achievements included universal Pre-K and 3-K, easing childcare burdens, and streamlining social services. These gains reversed in the pandemic’s first year.
  
  Adams, elected on a moderate platform, focused on crime, economic recovery, and supporting small businesses [Get Stuff Done, 2024](https://www.nyc.gov/content/getstuffdone/pages/initiatives). Under his administration, cash-assistance recipients rose by 23%, while staffing shortages and an asylum crisis strained services [Family, 2022](https://www.icph.org/reports/family-homelessness-in-new-york-city-what-the-adams-administration-can-learn-from-previous-mayoralties/#keeping-new-yorkers-housed-homelessness-prevention-vouchers-and-housing). Poverty and unemployment remain high, and Adams faces the lowest approval ratings of any Mayor.
  
  Analyzing 311 service requests for social services during the pandemic offers insights into residents’ experiences. While Open Data NYC provides request types, the full text is unavailable. Complementing this with sentiment analysis of Reddit discussions can reveal how young adults and older residents perceived government services during this time, the majority of Reddit users [Anderson, 2024](https://www.socialchamp.io/blog/reddit-demographics/).

  This research is critical for understanding how New Yorkers engage with social services. As the city faces ongoing challenges like the housing crisis, child care shortages, and rising welfare requests, this study can inform policymakers on public sentiment and service effectiveness by answering the following questions.




#1.2 Research Questions:
1.  Is there a change in sentiment in NYC/311 Reddit posts across the de Blasio and Adams administrations, during a period of a shift in focus from self sufficiency to expanded benefits access?
2.  Comparing key social service topics, what are the general trends in 311 service requests and NYC/311 Reddit posts related to social services across the two administrations? What is the sentiment of these themes?
3.  What are the most frequently used keywords by the public on the r/NYC 311 subreddit and 311 service complaints?


#1.3 Hypothesis:
1.  There will likely be more complaints related to welfare and human services during the Adams administration due to expanded access leading to staffing shortages and delays in processing times.
2.  The expanded pandemic-related aid created a dependency on social services, with complaints likely focusing on barriers to access, benefit expirations, and challenges in meeting basic needs.
3.  Sentiment around social services during the Adams administration is expected to be more negative due to budget cuts, the city's response to the immigration crisis, and efforts to roll back the “Right to Shelter” policy.
4.  Frequent keywords under de Blasio will include unemployment, benefits enrollment, and education, while under Adams, they are expected to focus on housing, migrants, and enrollment delays.


#1.4 Significance and Importance to Public Policy:
This project offers insights into how New Yorkers respond to social services given mayoral changes. Policymakers could use these findings to assess what areas may need further resources or adjustments. Insights into the sentiment trends could inform perception of policy.

In [None]:
!pip install asyncpraw

In [None]:
import pandas as pd
import asyncpraw
import asyncio
import time

# Initialize Async PRAW
reddit = asyncpraw.Reddit(
    client_id="xb9MZu2WxExu2aLo94v8rA",
    client_secret="4s-xlQ0vw1qnK7Pa06o-tc8as4h0Yw",
    user_agent="reddit_text_extractor (by u/Good-Bread-994)"
)

# Subreddits to fetch data from
subreddits = ["nyc", "AskNYC", "newyorkcity", "NYChousing", "311"]

# Function to fetch posts and their top parent comment
async def fetch_posts(subreddits):
    posts = []

    # Loop through each subreddit
    for subreddit_name in subreddits:
        subreddit = await reddit.subreddit(subreddit_name)  # Await the subreddit coroutine
        async for submission in subreddit.top(limit=500):  # Adjust the limit if needed
            # Ensure submission is fully loaded
            await submission.load()

            # Fetch the top parent comment
            submission.comments.replace_more(limit=0)  # Replace "more comments" with the actual comments
            top_comment = submission.comments[0].body if submission.comments else "No comments"

            posts.append({
                'subreddit': subreddit_name,
                'title': submission.title,
                'selftext': submission.selftext,
                'created_utc': submission.created_utc,
                'top_parent_comment': top_comment,  # Add the top parent comment
            })

    return posts

# Function to run the async task and save to a CSV
async def main():
    posts = await fetch_posts(subreddits)

    # Convert to DataFrame
    df = pd.DataFrame(posts)

    # Save to CSV
    df.to_csv('reddit_threads_with_parent_comments.csv', index=False)
    print("Data saved to 'reddit_threads_with_parent_comments.csv'")

    # Close the reddit session to avoid unclosed session warning
    await reddit.close()

# In Google Colab, use await directly:
await main()
#The following script is adapted from a Medium tutorial on building a Reddit Scraper with Python and Colab: https://python.plainenglish.io/two-step-wsb-scraper-with-colab-b240af5a6105 and from Melanie Walsh's Reddit Data Collection and Analysis: https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/14-Reddit-Data.html

Download file with Reddit data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
from datetime import datetime

# Load the existing CSV file into a DataFrame
df = pd.read_csv('/content/drive/MyDrive/Python F24/Final Project/reddit_threads_with_parent_comments - Use for Project.csv')

# Convert 'created_utc' to a datetime object (human-readable format)
df['created_time'] = pd.to_datetime(df['created_utc'], unit='s')

# Filter posts between 2020 and 2023
reddit_df = df[(df['created_time'].dt.year >= 2020) & (df['created_time'].dt.year <= 2023)]
#I adapted this script for the above and below codes from Rebecca Krisel's Intro to Pandas Workshop: https://github.com/rskrisel/pandas/blob/main/pandas_workshop_2024.ipynb

In [None]:
reddit_df.sample(100)


In [None]:
reddit_df.info()

In [None]:
reddit_df['created_utc'] = pd.to_datetime(reddit_df['created_utc'],  unit='s')
reddit_df['created_utc_str'] = reddit_df['created_utc'].dt.strftime('%Y-%m-%d')

In [None]:
reddit_df.dtypes

In [None]:
reddit_df[reddit_df.duplicated(keep=False)]

In [None]:
reddit_df.columns

In [None]:
reddit_df=reddit_df.rename(columns={'created_utc_str':'date', 'selftext':'textpost'})
reddit_df

In [None]:
reddit_drop_date_df = reddit_df.drop(columns=['created_utc', 'created_time'])
reddit_drop_date_df

In [None]:
reddit_drop_date_df.sort_values(by='date', ascending=False)

In [None]:
print(reddit_drop_date_df['date'].min(), reddit_drop_date_df['date'].max())


In [None]:
reddit_drop_date_df.groupby('subreddit').agg({'title': 'count', 'textpost': 'count', 'top_parent_comment': 'count', 'date': 'first'}).sort_values(by='title', ascending=False)

In [None]:
import matplotlib.pyplot as plt

# Convert 'date' to datetime format if it's not already
reddit_drop_date_df['date'] = pd.to_datetime(reddit_drop_date_df['date'])

# Group by week and month, counting the number of posts
weekly_trend = reddit_drop_date_df.groupby(reddit_drop_date_df['date'].dt.to_period('W')).size()
monthly_trend = reddit_drop_date_df.groupby(reddit_drop_date_df['date'].dt.to_period('M')).size()

# Plotting weekly trend
monthly_trend.plot(kind='line')

# Adding title and labels
plt.title('Monthly Trend of Reddit Posts 2020-2023')  # Title of the plot
plt.xlabel('Month')  # Label for the x-axis
plt.ylabel('Number of Posts')  # Label for the y-axis

# Display the plot
plt.tight_layout()  # Adjust the layout for better readability
plt.show()
#I adapted the scripts to produce the monthly and annual trends from Rebecca Krisel's Intro to Pandas Workshop: https://github.com/rskrisel/pandas/blob/main/pandas_workshop_2024.ipynb

In [None]:
annual_trend = reddit_drop_date_df.groupby(reddit_drop_date_df['date'].dt.to_period('A')).size()

annual_trend.plot(kind='line')

plt.title('Annual Trend of Reddit Posts')  # Title of the plot
plt.xlabel('Year')  # Label for the x-axis
plt.ylabel('Number of Posts')  # Label for the y-axis

plt.tight_layout()  # Adjust the layout for better readability
plt.show()


In [None]:
# Search for rows where any column contains the word
result = reddit_drop_date_df.apply(lambda row: row.astype(str).str.contains('311', case=False).any(), axis=1)
filtered_df = reddit_drop_date_df[result]

total_count = len(filtered_df)

# Print the total count
print("Total count of rows containing '311':", total_count)

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab') # Download the punkt_tab model


# Define stopwords
stop_words = set(stopwords.words('english'))

def tokenize_and_remove_stopwords(text):
    if pd.isna(text):  # Check if the value is NaN
        return ''  # Return an empty string for NaN values
    tokens = word_tokenize(str(text))  # Tokenize the text
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word.isalnum()]  # Remove stopwords and punctuation
    return ' '.join(filtered_tokens)  # Join tokens back into a string

# List of columns to clean
columns_to_clean = ['top_parent_comment', 'title', 'textpost']  # Replace with your column names

# Process each column
for column in columns_to_clean:
    if column in reddit_drop_date_df.columns:  # Ensure column exists in the DataFrame
        reddit_drop_date_df[f'cleaned_{column}'] = reddit_drop_date_df[column].apply(tokenize_and_remove_stopwords)

# Display the updated DataFrame
print(reddit_drop_date_df.head())
# I adapted this script from Rebecca Krisel's Intro to NLTK Workshop: https://github.com/rskrisel/intro_to_nltk/blob/main/Intro_NLTK_workshop.ipynb and prompted ChatGPT to update the code to hanlde NaN values.

In [None]:
reddit_drop_date_df

In [None]:
reddit_short_df = reddit_drop_date_df.drop(columns=['title', 'textpost','top_parent_comment',])
reddit_short_df

In [None]:
# Create a new column 'combined_text' that concatenates 'cleaned_title' and 'cleaned_textpost'
reddit_short_df['combined_text'] = reddit_short_df['cleaned_title'] + ' ' + reddit_short_df['cleaned_textpost']

# Display the updated DataFrame with the new 'combined_text' column
print(reddit_short_df[['cleaned_title', 'cleaned_textpost', 'combined_text']].head())


In [None]:
reddit_short_df

In [None]:
# Drop the 'cleaned_textpost' column
reddit_short_df = reddit_short_df.drop(columns=['cleaned_textpost', 'cleaned_title'])

# Display the updated DataFrame
print(reddit_short_df.head())


In [None]:
reddit_short_df

Need to separate by two time periods.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import string
from datetime import datetime


# Convert start_date and end_date to datetime64[ns]
start_date_deblasio = pd.to_datetime('2020-01-01').to_numpy().astype('datetime64[ns]')
end_date_deblasio = pd.to_datetime('2021-12-31').to_numpy().astype('datetime64[ns]')

start_date_adams = pd.to_datetime('2022-01-01').to_numpy().astype('datetime64[ns]')
end_date_adams = pd.to_datetime('2023-12-31').to_numpy().astype('datetime64[ns]')

# Filter the DataFrame
deblasio_df = reddit_short_df[(reddit_short_df['date'] >= start_date_deblasio) & (reddit_short_df['date'] <= end_date_deblasio)]
adams_df = reddit_short_df[(reddit_short_df['date'] >= start_date_adams) & (reddit_short_df['date'] <= end_date_adams)]


# Combine text columns into a single string
text_columns = ['cleaned_top_parent_comment', 'combined_text']
combined_text_deblasio = deblasio_df[text_columns].fillna('').agg(' '.join, axis=1).str.cat(sep=' ')
combined_text_adams = adams_df[text_columns].fillna('').agg(' '.join, axis=1).str.cat(sep=' ')

tokens_deblasio = word_tokenize(combined_text_deblasio)
tokens_adams = word_tokenize(combined_text_adams)

# Extend the stopword list
custom_stop_words = {'nan', 'https', 'like', 'nyc', 'get', 'city', 'know', 'got', 'going', 'york', 'really', 'also', 'new', 'nan'}
stop_words = set(stopwords.words('english')).union(custom_stop_words)

# Clean the tokens
tokens_deblasio = [word.lower() for word in tokens_deblasio if word.isalnum() and word.lower() not in stop_words]
tokens_adams = [word.lower() for word in tokens_adams if word.isalnum() and word.lower() not in stop_words]

# Generate frequency distribution
word_counts_deblasio = Counter(tokens_deblasio)
word_counts_adams = Counter(tokens_adams)

# Get the top 20 keywords
top_keywords_deblasio = word_counts_deblasio.most_common(20)
top_keywords_adams = word_counts_adams.most_common(20)

# Display the results
print("Top 20 Keywords De blasio:")
for word1, count1 in top_keywords_deblasio:
    print(f"{word1}: {count1}")

print("Top 20 Keywords Adams:")
for word, count in top_keywords_adams:
    print(f"{word}: {count}")
#I adapted this script from Rebecca Krisel's Intro to NLTK Workshop: https://github.com/rskrisel/intro_to_nltk/blob/main/Intro_NLTK_workshop.ipynb and prompted ChatGPT to update the code with two timeframes.

In [None]:
deblasio_df_key = pd.DataFrame(top_keywords_deblasio, columns=['Keyword', 'De Blasio']).set_index('Keyword')
adams_df_key = pd.DataFrame(top_keywords_adams, columns=['Keyword', 'Adams']).set_index('Keyword')

# Calculate the percentage difference
comparison_df['Percentage Difference'] = (
    (comparison_df['Adams'] - comparison_df['De Blasio']) / comparison_df['De Blasio']
) * 100

# Replace infinite or NaN values (from division by zero) with a placeholder like 0 or "N/A"
comparison_df['Percentage Difference'] = comparison_df['Percentage Difference'].replace([float('inf'), -float('inf'), float('nan')], 0)

# Sort by the percentage difference if desired
comparison_df = comparison_df.sort_values(by='Percentage Difference', ascending=False)

# Display the comparison table
print(comparison_df)
#I shared the frequency distribution with ChatGPT and asked it to calculate the percentage difference between the two time periods with a table.

In [None]:
deblasio_df

In [None]:
adams_df

In [None]:
combined_text_adams

In [None]:
combined_text_deblasio

In [None]:
key_words = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']
print(key_words)

In [None]:
keyword_frequencies_deblasio = {}
keyword_frequencies_adams = {}
for keyword in key_words:

    keyword_frequencies_deblasio[keyword] = tokens_deblasio.count(keyword.lower())
    keyword_frequencies_adams[keyword] = tokens_adams.count(keyword.lower())
print("Keyword Frequencies de Blasio:")
for keyword, frequency in keyword_frequencies_deblasio.items():
    print(f"{keyword}: {frequency}")
print("Keyword Frequencies Adams:")
for keyword, frequency in keyword_frequencies_adams.items():
    print(f"{keyword}: {frequency}")
#I adapted this script from Rebecca Krisel's Intro to NLTK Workshop: https://github.com/rskrisel/intro_to_nltk/blob/main/Intro_NLTK_workshop.ipynb

In [None]:
# Organize the keyword frequencies from smallest to largest
sorted_frequencies_adams = sorted(keyword_frequencies_adams.items(), key=lambda x: x[1])
sorted_frequencies_deblasio = sorted(keyword_frequencies_deblasio.items(), key=lambda x: x[1])

print("Sorted Keyword Frequencies Adams (Smallest to Largest):")
for keyword, frequency in sorted_frequencies_adams:
    print(f"{keyword}: {frequency}")
print("Sorted Keyword Frequencies de Blasio (Smallest to Largest):")
for keyword, frequency in sorted_frequencies_deblasio:
    print(f"{keyword}: {frequency}")
#I prompted ChatGPT to organize the keyword frequencies from smallest to largest.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Define keywords and initialize frequency dictionaries
key_words = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

# Initialize the keyword frequency dictionaries with 0 counts for each keyword
keyword_frequencies_deblasio = {keyword: 0 for keyword in key_words}
keyword_frequencies_adams = {keyword: 0 for keyword in key_words}

for keyword in key_words:

    keyword_frequencies_deblasio[keyword] = tokens_deblasio.count(keyword.lower())
    keyword_frequencies_adams[keyword] = tokens_adams.count(keyword.lower())


keywords = key_words
de_blasio_frequencies = [keyword_frequencies_deblasio[keyword] for keyword in key_words]
adams_frequencies = [keyword_frequencies_adams[keyword] for keyword in key_words]

# Sort the frequencies
sorted_frequencies_adams = dict(sorted(keyword_frequencies_adams.items(), key=lambda item: item[1], reverse=True))
sorted_frequencies_deblasio = dict(sorted(keyword_frequencies_deblasio.items(), key=lambda item: item[1], reverse=True))

# Plot the first chart for De Blasio
plt.figure(figsize=(10, 6))
plt.bar(sorted_frequencies_deblasio.keys(), sorted_frequencies_deblasio.values(), color='skyblue')
plt.xlabel('Keywords', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Frequency of Keywords in Cleaned Text - De Blasio', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Plot the second chart for Adams
plt.figure(figsize=(10, 6))
plt.bar(sorted_frequencies_adams.keys(), sorted_frequencies_adams.values(), color='lightcoral')
plt.xlabel('Keywords', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Frequency of Keywords in Cleaned Text - Adams', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
#I adapted this script from Rebecca Krisel's Intro to NLTK Workshop: https://github.com/rskrisel/intro_to_nltk/blob/main/Intro_NLTK_workshop.ipynb and asked ChatGPT to incorporate my keyword list

Sentiment Analysis
I used VADER, designed for short social media texts to capture subtle positive and negative tones in comments, to conduct the sentiment analysis on my keyword list. I downloaded the NLTK library for keyword extraction and co-occurrence analysis to reveal different trends in complaints across the shift from de Blasio to Adams. I segmented the posts by administration to capture if negativity increases around topics like housing shortages, covid, and benefits access delays.

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('all')

analyzer = SentimentIntensityAnalyzer()
#I adapted this script from Melanie Walsh's Sentiment Analysis: https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/04-Sentiment-Analysis.html

In [None]:
def get_sentiment(text):
    # Get the sentiment scores for the text
    scores = analyzer.polarity_scores(text)

    # Return the compound sentiment score (this score reflects the overall sentiment)
    return scores['compound']

# Ensure you're working with a copy of the DataFrame to avoid SettingWithCopyWarning
adams_df = adams_df.copy()
deblasio_df = deblasio_df.copy()

# Apply the get_sentiment function using .loc to avoid SettingWithCopyWarning
adams_df.loc[:, 'combined_text_sentiment'] = adams_df['combined_text'].apply(get_sentiment)
adams_df.loc[:, 'parent_comment_sentiment'] = adams_df['cleaned_top_parent_comment'].apply(get_sentiment)
deblasio_df.loc[:, 'combined_text_sentiment'] = deblasio_df['combined_text'].apply(get_sentiment)
deblasio_df.loc[:, 'parent_comment_sentiment'] = deblasio_df['cleaned_top_parent_comment'].apply(get_sentiment)

# Adjust display settings for horizontal output
pd.set_option('display.max_colwidth', 100)  # Truncate long columns for better readability
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping for wide DataFrames

# Adams Period Sentiment Analysis - Set index to comment ID for better readability
adams_display = adams_df[['combined_text', 'combined_text_sentiment', 'cleaned_top_parent_comment', 'parent_comment_sentiment']]
adams_display = adams_display.reset_index(drop=True)

# De Blasio Period Sentiment Analysis - Set index to comment ID for better readability
deblasio_display = deblasio_df[['combined_text', 'combined_text_sentiment', 'cleaned_top_parent_comment', 'parent_comment_sentiment']]
deblasio_display = deblasio_display.reset_index(drop=True)

# Print Results
print("Adams Period Sentiment Analysis:")
print(adams_display)

print("\nDe Blasio Period Sentiment Analysis:")
print(deblasio_display)
#Using Melanie Walsh's Sentiment Analysis guide, I prompted ChatGPT to write a code to apply the sentiment analysis to my two dataframes and specific columns

To analyze the average sentiment, I selected the key words from my initial hypthosis with a frequency of greater than 10. Anything under 10 was dropped, as the terms were less relevant than originally expected. The words I removed from my key word search were: immigration, migrant, delay, Blasio, and employment.

In [None]:
# Initialize the keywords list
keywords = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

# Function to calculate average sentiment for each keyword
def calculate_average_sentiment(df, keywords, sentiment_column):
    average_sentiments = {}

    for keyword in keywords:
        # Filter rows where the keyword is present in 'combined_text'
        # using str.contains for case-insensitive search
        keyword_rows = df[df['combined_text'].str.contains(keyword, case=False, na=False)]

        # Calculate the average sentiment score for these rows
        avg_sentiment = keyword_rows[sentiment_column].mean()
        average_sentiments[keyword] = avg_sentiment

    return average_sentiments

# Calculate average sentiment for Adams period
adams_average_sentiments = calculate_average_sentiment(adams_df, keywords, 'post_sentiment')

# Calculate average sentiment for De Blasio period
deblasio_average_sentiments = calculate_average_sentiment(deblasio_df, keywords, 'post_sentiment')

# Print the average sentiment for each keyword for both periods
print("Average Post Sentiment for Adams Period:")
for keyword, avg_sentiment in adams_average_sentiments.items():
    print(f"{keyword}: {avg_sentiment}")

print("\nAverage Post Sentiment for De Blasio Period:")
for keyword, avg_sentiment in deblasio_average_sentiments.items():
    print(f"{keyword}: {avg_sentiment}")

#Using Melanie Walsh's Sentiment Analysis guide, I prompted ChatGPT to write a code to apply the sentiment analysis to my two dataframes and the post_sentiment column


In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt  # Ensure you have plt imported for the plot

# Define the keywords list
keywords = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

# Create a DataFrame
# Convert dictionaries to lists for 'Adams Sentiment' and 'De Blasio Sentiment' columns
data = {
    'Keyword': keywords,
    'Adams Sentiment': [adams_average_sentiments.get(keyword, float('nan')) for keyword in keywords],
    'De Blasio Sentiment': [deblasio_average_sentiments.get(keyword, float('nan')) for keyword in keywords]
}

df = pd.DataFrame(data)

# Plotting heatmap
df.set_index('Keyword', inplace=True)
plt.figure(figsize=(10, 6))
sns.heatmap(df.T, annot=True, cmap="coolwarm", center=0, cbar_kws={'label': 'Sentiment'})
plt.title('Sentiment Heatmap: Adams vs De Blasio')
plt.show()

#I adapted the above script from Geeks for Geeks Seaborn Heatmap Guide: https://www.geeksforgeeks.org/seaborn-heatmap-a-comprehensive-guide/ and I prompted ChatGPT to modify the code with my keyword list

To analyze the sentiment of the keywords with a frequency distribution greater than 10, I took the average sentiment from the post sentiment column. Budget and Wage have the highest sentiment score. This likely alludes to the liminations of sentiment analysis - both of these keywords were viewed positively in the press or in anlysis of the mayors' administrations. Budget, particularly during the Adams administration was highly criticized as he cut the budget of all agencies by 5% and threated to significantly reduce library and senior services. Benefits had the highest sentiment score at 0.99, which could highlight the increase in benefits during the Covid pandemic or just show that Reddit users speak positively about benefits offered in NYC. Wage also had a high sentiment score under de Blasio, surprising given the percentage of people unemployed.

Keywords that received the lowest sentiment scores were Adams at 0.004 and shelter at 0.13. The low rating of shelter from 2020 - 2023 aligns with my hypothesis that shelter would be perceived as negative during the timeframe. This is when shelters were at capacity due to the migrant crisis. Adams also has the lowest sentiment score.

In [None]:
# Ensure DataFrame slices are explicit copies to avoid SettingWithCopyWarning
deblasio_df = reddit_short_df[(reddit_short_df['date'] >= start_date_deblasio) & (reddit_short_df['date'] <= end_date_deblasio)].copy()
adams_df = reddit_short_df[(reddit_short_df['date'] >= start_date_adams) & (reddit_short_df['date'] <= end_date_adams)].copy()

# Initialize the keywords list
keywords = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize the SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

# Create the get_sentiment function
def get_sentiment(text):
    # Get the sentiment scores for the text
    scores = analyzer.polarity_scores(text)

    # Return the compound sentiment score (this score reflects the overall sentiment)
    return scores['compound']

# Apply the get_sentiment function to both DataFrames
adams_df['parent_comment_sentiment'] = adams_df['cleaned_top_parent_comment'].apply(get_sentiment)
deblasio_df['parent_comment_sentiment'] = deblasio_df['cleaned_top_parent_comment'].apply(get_sentiment)

# Function to calculate average sentiment for each keyword for both periods
def calculate_average_sentiment(df, keywords, sentiment_column):
    average_sentiments = {}

    for keyword in keywords:
        # Filter rows where the keyword is present in 'combined_text'
        # using str.contains for case-insensitive search
        keyword_rows = df[df['combined_text'].str.contains(keyword, case=False, na=False)]

        # Calculate the average sentiment score for these rows
        avg_sentiment = keyword_rows[sentiment_column].mean()
        average_sentiments[keyword] = avg_sentiment

    return average_sentiments

# Calculate average sentiment for Adams period
adams_average_sentiments = calculate_average_sentiment(adams_df, keywords, 'parent_comment_sentiment')

# Calculate average sentiment for De Blasio period
deblasio_average_sentiments = calculate_average_sentiment(deblasio_df, keywords, 'parent_comment_sentiment')

# Print the average sentiment for Adams period
print("Average Parent Comment Sentiment for Adams Period:")
for keyword, sentiment in adams_average_sentiments.items():
    print(f"{keyword}: {sentiment}")

# Print a separator line
print("\n" + "-"*40 + "\n")

# Print the average sentiment for De Blasio period
print("Average Parent Comment Sentiment for De Blasio Period:")
for keyword, sentiment in deblasio_average_sentiments.items():
    print(f"{keyword}: {sentiment}")

#Using Melanie Walsh's Sentiment Analysis guide, I prompted ChatGPT to write a code to include the parent_comment sentiment for the Adams time period and de Blasio time period.

In [None]:
# Define the keywords list
keywords = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

# Create a DataFrame
# Convert dictionaries to lists for 'Adams Sentiment' and 'De Blasio Sentiment' columns
data = {
    'Keyword': keywords,
    'Adams Sentiment': [adams_average_sentiments.get(keyword, float('nan')) for keyword in keywords],
    'De Blasio Sentiment': [deblasio_average_sentiments.get(keyword, float('nan')) for keyword in keywords]
}

df = pd.DataFrame(data)

# Plotting heatmap
df.set_index('Keyword', inplace=True)
plt.figure(figsize=(10, 6))
sns.heatmap(df.T, annot=True, cmap="coolwarm", center=0, cbar_kws={'label': 'Sentiment'})
plt.title('Sentiment Heatmap: Adams vs De Blasio')
plt.show()

#I adapted the above script from Geeks for Geeks Seaborn Heatmap Guide: https://www.geeksforgeeks.org/seaborn-heatmap-a-comprehensive-guide/ and I prompted ChatGPT to modify the code with my keyword list

In [None]:
# Ensure DataFrame slices are explicit copies to avoid SettingWithCopyWarning
deblasio_df = reddit_short_df[(reddit_short_df['date'] >= start_date_deblasio) & (reddit_short_df['date'] <= end_date_deblasio)].copy()
adams_df = reddit_short_df[(reddit_short_df['date'] >= start_date_adams) & (reddit_short_df['date'] <= end_date_adams)].copy()

# Initialize the keywords list
keywords = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize the SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

# Create the get_sentiment function
def get_sentiment(text):
    # Get the sentiment scores for the text
    scores = analyzer.polarity_scores(text)

    # Return the compound sentiment score (this score reflects the overall sentiment)
    return scores['compound']

# Apply the get_sentiment function to both DataFrames, creating 'post_sentiment'
adams_df['post_sentiment'] = adams_df['combined_text'].apply(get_sentiment) #Creating post_sentiment column
adams_df['parent_comment_sentiment'] = adams_df['cleaned_top_parent_comment'].apply(get_sentiment)
deblasio_df['post_sentiment'] = deblasio_df['combined_text'].apply(get_sentiment) #Creating post_sentiment column
deblasio_df['parent_comment_sentiment'] = deblasio_df['cleaned_top_parent_comment'].apply(get_sentiment)

# Function to calculate average sentiment for each keyword across both parent_comment_sentiment and post_sentiment
def calculate_average_sentiment(df, keywords):
    average_sentiments = {}

    for keyword in keywords:
        # Filter rows where the keyword is present in 'combined_text'
        keyword_rows = df[df['combined_text'].str.contains(keyword, case=False, na=False)].copy()  # Explicit copy to avoid SettingWithCopyWarning

        # Calculate the average sentiment score for these rows across both columns (parent_comment_sentiment and post_sentiment)
        keyword_rows['average_sentiment'] = keyword_rows[['parent_comment_sentiment', 'post_sentiment']].mean(axis=1)

        # Calculate the average of the combined sentiment for this keyword
        avg_sentiment = keyword_rows['average_sentiment'].mean()
        average_sentiments[keyword] = avg_sentiment

    return average_sentiments

# Calculate average sentiment for Adams period
adams_average_sentiments = calculate_average_sentiment(adams_df, keywords)

# Calculate average sentiment for De Blasio period
deblasio_average_sentiments = calculate_average_sentiment(deblasio_df, keywords)

# Print the average sentiment for each keyword for both periods
print("Average Sentiment for Adams Period (combined):")
for keyword, avg_sentiment in adams_average_sentiments.items():
    print(f"{keyword}: {avg_sentiment}")

print("\nAverage Sentiment for De Blasio Period (combined):")
for keyword, avg_sentiment in deblasio_average_sentiments.items():
    print(f"{keyword}: {avg_sentiment}")

#Using Melanie Walsh's Sentiment Analysis guide, I prompted ChatGPT to write a code to average the sentiment for the Adams time period and de Blasio time period.

In [None]:
# Define the keywords list
keywords = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

# Create a DataFrame
# Convert dictionaries to lists for 'Adams Sentiment' and 'De Blasio Sentiment' columns
data = {
    'Keyword': keywords,
    'Adams Sentiment': [adams_average_sentiments.get(keyword, float('nan')) for keyword in keywords],
    'De Blasio Sentiment': [deblasio_average_sentiments.get(keyword, float('nan')) for keyword in keywords]
}

df = pd.DataFrame(data)

# Plotting heatmap
df.set_index('Keyword', inplace=True)
plt.figure(figsize=(10, 6))
sns.heatmap(df.T, annot=True, cmap="coolwarm", center=0, cbar_kws={'label': 'Sentiment'})
plt.title('Sentiment Heatmap: Adams vs De Blasio')
plt.show()

#I adapted the above script from Geeks for Geeks Seaborn Heatmap Guide: https://www.geeksforgeeks.org/seaborn-heatmap-a-comprehensive-guide/ and I prompted ChatGPT to modify the code with my keyword list

Adams has a slightly higher sentiment in parent comments but it is still the lowest of the key words. Benefits decreased significantly, by 0.4 to be 0.51. This could be because posts are generally more question oriented or neutral where as parent comments express a response to the post.

In [None]:
!pip install wordcloud matplotlib
from wordcloud import WordCloud

# Keywords list
keywords = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

# Function to calculate keyword frequencies for a given DataFrame
def calculate_keyword_frequencies(df, keywords):
    # Initialize an empty dictionary to store frequencies
    keyword_frequencies = {}

    # Loop through each keyword
    for keyword in keywords:
        # Count occurrences of the keyword in 'combined_text' column, ignoring case
        keyword_frequencies[keyword] = df['combined_text'].str.contains(keyword, case=False).sum()

    return keyword_frequencies

# Calculate keyword frequencies for Adams period
adams_keyword_frequencies = calculate_keyword_frequencies(adams_df, keywords)

# Calculate keyword frequencies for De Blasio period
deblasio_keyword_frequencies = calculate_keyword_frequencies(deblasio_df, keywords)

# Create and display the word cloud for Adams period
wordcloud_adams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(adams_keyword_frequencies)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_adams, interpolation='bilinear')
plt.axis('off')  # Turn off the axis
plt.title("Adams Period - Keyword Frequency Word Cloud")
plt.show()

# Create and display the word cloud for De Blasio period
wordcloud_deblasio = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(deblasio_keyword_frequencies)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_deblasio, interpolation='bilinear')
plt.axis('off')  # Turn off the axis
plt.title("De Blasio Period - Keyword Frequency Word Cloud")
plt.show()

#I adapted the above and below code from Geeks for Geeks Generating Word Cloud in Python Guide: https://www.geeksforgeeks.org/generating-word-cloud-python/ and asked ChatGPT to incorporate my key words and their respective frequency distribution

In [None]:
# Additional keywords with their frequencies for both periods
additional_keywords_adams = {
    'people': 519, 'new': 429, 'one': 393, 'time': 312, 'would': 274, 'go': 266,
    'back': 238, 'even': 215, 'see': 213, 'want': 210, 'day': 208, 'feel': 180,
    'good': 176, '311': 176, 'work': 175, 'love': 171, 'every': 170, 'way': 168,
    'years': 168, 'much': 167
}

additional_keywords_deblasio = {
    'people': 489, 'new': 421, 'one': 389, 'time': 318, 'would': 268, 'go': 257,
    'back': 230, 'even': 213, 'see': 210, 'want': 215, 'day': 206, 'feel': 173,
    'good': 172, '311': 160, 'work': 160, 'love': 159, 'every': 155, 'way': 150,
    'years': 162, 'much': 158
}

# Create and display the word cloud for Adams period using the additional keywords
wordcloud_adams = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(additional_keywords_adams)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_adams, interpolation='bilinear')
plt.axis('off')  # Turn off the axis
plt.title("Adams Period - Additional Keyword Frequency Word Cloud")
plt.show()

# Create and display the word cloud for De Blasio period using the additional keywords
wordcloud_deblasio = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(additional_keywords_deblasio)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_deblasio, interpolation='bilinear')
plt.axis('off')  # Turn off the axis
plt.title("De Blasio Period - Additional Keyword Frequency Word Cloud")
plt.show()


NYC Open Data 311 Analysis
1. df
2. 2020 - 2023
3. scrape 311 open data
3. clean 311 open data: lemmatize, tokenize, remove stop words
4. frequency analysis - broken up by de Blasio Jan 1, 2020 - Dec 31, 2021 and Adams Jan 1 2022 - Dec 31, 2023
6. Word Cloud, frequency distribution matplob

In [None]:
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

file_path = '/content/drive/MyDrive/Python F24/Final Project/311_Service_Requests (2).csv'

df = pd.read_csv(file_path)

print(df.head())

#I adapted this code from Rebecca Krisel's Data Manipulation in Pandas and Python Workshop: https://github.com/rskrisel/pandas/blob/main/pandas_workshop_2024.ipynb

In [None]:
df

In [None]:
df = df.drop(columns=['Location Type'])

In [None]:
print(df.columns)

In [None]:
df.info()

In [None]:
df['Unique Key'] = df['Unique Key'].astype(str)

print(df.dtypes)


In [None]:
print(df['Created Date'].min(), df['Created Date'].max())


In [None]:
df['Created Date'] = pd.to_datetime(df['Created Date'],
                                    format='%m/%d/%Y %I:%M:%S %p')

In [None]:
df[df.duplicated(keep=False)]

In [None]:
unique_agencies = df['Agency'].unique()

print(unique_agencies)


I made the decision to include NYPD because calls about unhoused populations often are directed to the NYPD. The other agencies are directly related to benefits, wages, and social services.

In [None]:
agencies_to_include = ['DOHMH', 'NYPD', 'HPD', 'DCWP', 'DHS']

agencies_df = df[df['Agency'].isin(agencies_to_include)]

print(agencies_df)


I am not going to tokenize because the data is already tokenized. Below I decided to remove any descriptor with more than one description because it just provides numbers and not words, not good for analysis.

In [None]:
df_filtered = agencies_df[agencies_df['Descriptor'].apply(lambda x: not bool(pd.notnull(x) and isinstance(x, str) and any(char.isdigit() for char in x)))]

print(df_filtered)

In [None]:
df_filtered

Now, I need to divide my data into two date ranges per administration and summarize the number of complaints.

In [None]:
date_range1_start = '2020-01-01'
date_range1_end = '2021-12-31'
date_range2_start = '2022-01-01'
date_range2_end = '2023-12-31'

# Filter the DataFrame for each date range
df_range1 = df_filtered[(df_filtered['Created Date'] >= date_range1_start) & (df_filtered['Created Date'] <= date_range1_end)]
df_range2 = df_filtered[(df_filtered['Created Date'] >= date_range2_start) & (df_filtered['Created Date'] <= date_range2_end)]

# Group by 'Agency' and count the complaints
complaints_range1 = df_range1['Agency'].value_counts().reset_index()
complaints_range1.columns = ['Agency', 'Complaints (2020-2021)']

complaints_range2 = df_range2['Agency'].value_counts().reset_index()
complaints_range2.columns = ['Agency', 'Complaints (2022-2023)']

# Merge the results into a single DataFrame
complaints_summary = pd.merge(
    complaints_range1,
    complaints_range2,
    on='Agency',
    how='outer'
).fillna(0)  # Fill NaN values with 0

# Convert counts to integers
complaints_summary[['Complaints (2020-2021)', 'Complaints (2022-2023)']] = complaints_summary[
    ['Complaints (2020-2021)', 'Complaints (2022-2023)']
].astype(int)

# Display the summary
print(complaints_summary)

#I prompted ChatGPT to group the complaints by agency across the two timeframes and print as a table.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Set positions and width for the bars
x = np.arange(len(complaints_summary['Agency']))  # label locations
width = 0.35  # width of bars

# Create the figure and axes
fig, ax = plt.subplots(figsize=(10, 6))

# Bar plots for each date range
bars1 = ax.bar(x - width/2, complaints_summary['Complaints (2020-2021)'], width, label='2020-2021')
bars2 = ax.bar(x + width/2, complaints_summary['Complaints (2022-2023)'], width, label='2022-2023')

# Add labels, title, and legend
ax.set_xlabel('Agency', fontsize=12)
ax.set_ylabel('Number of Complaints', fontsize=12)
ax.set_title('Number of Complaints by Agency (2020-2023)', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(complaints_summary['Agency'], rotation=45, ha='right')
ax.legend()

# Add value annotations on top of bars
for bar in bars1 + bars2:
    height = bar.get_height()
    ax.annotate(f'{height}',
                xy=(bar.get_x() + bar.get_width() / 2, height),
                xytext=(0, 3),  # Offset text above the bar
                textcoords="offset points",
                ha='center', va='bottom')

# Improve layout
plt.tight_layout()
plt.show()

#I adapted this code from Rebecca Krisel's Data Manipulation in Pandas and Python Workshop: https://github.com/rskrisel/pandas/blob/main/pandas_workshop_2024.ipynb and prompted ChatGPT to modify the code with the two time frames

In [None]:
!pip install nltk
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import string

# Ensure stopwords are downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab') # Download the punkt_tab model

# Define stop words and punctuation
custom_stop_words = ['nan']
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

# Function to preprocess and get word frequencies
def get_top_words(df, column_name, top_n=20):
    # Combine all text in the specified column
    all_text = ' '.join(df[column_name].dropna().astype(str))

    # Tokenize, remove punctuation and stop words, and convert to lowercase
    tokens = word_tokenize(all_text)
    cleaned_tokens = [
        word.lower() for word in tokens
        if word.lower() not in stop_words and word not in punctuation and word.isalpha()
    ]

    # Get word frequencies
    freq_dist = FreqDist(cleaned_tokens)
    return freq_dist.most_common(top_n)

# Get top 20 words for each period
df_filtered.loc[:, 'Created Date'] = pd.to_datetime(df_filtered['Created Date'])

df_range1 = df_filtered.loc[(df_filtered['Created Date'] >= date_range1_start) & (df_filtered['Created Date'] <= date_range1_end)]
df_range2 = df_filtered.loc[(df_filtered['Created Date'] >= date_range2_start) & (df_filtered['Created Date'] <= date_range2_end)]

top_words_range1 = get_top_words(df_range1, 'Descriptor', top_n=20)
top_words_range2 = get_top_words(df_range2, 'Descriptor', top_n=20)

# Print results
print("Top 20 Words for de Blasio Administration in 311 Data:")
for word, count in top_words_range1:
    print(f"{word}: {count}")

print("\nTop 20 Words for Adams Administration in 311 Data:")
for word, count in top_words_range2:
    print(f"{word}: {count}")
#I adapted this script from Rebecca Krisel's Intro to NLTK Workshop: https://github.com/rskrisel/intro_to_nltk/blob/main/Intro_NLTK_workshop.ipynb and prompted ChatGPT to update the code with two timeframes, custom stop words, and top 20 words.

In [None]:
range1_df = range1_df.sort_values(by='Count', ascending=False)
range2_df = range2_df.sort_values(by='Count', ascending=False)

plt.figure(figsize=(10, 6))
plt.bar(range1_df['Word'], range1_df['Count'], color='skyblue')
plt.xlabel('Keywords', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Top Keywords in 311 Data During de Blasio Administration', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
plt.bar(range2_df['Word'], range2_df['Count'], color='lightcoral')
plt.xlabel('Keywords', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Top Keywords in 311 Data During Adams Administration', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

#I adapted this script from Rebecca Krisel's Pandas Workshop: https://github.com/rskrisel/pandas/blob/main/pandas_workshop_2024.ipynb

In [None]:
key_words = ['housing', 'benefits', 'shelter', 'wait', 'delays', 'budget', 'immigration', 'migrant', 'covid', 'blasio', 'adams', 'wage', 'employment', 'rent']

def count_keywords(df, key_words, text_column):
    keyword_counts = {word: 0 for word in key_words}  # Initialize dictionary with 0 counts for each keyword

    # Iterate over each row in the DataFrame
    for text in df[text_column].dropna():  # Use dropna() to skip NaN values
        # Convert text to lowercase for case-insensitive matching
        text = text.lower()

        # Check each keyword and count its occurrences
        for word in key_words:
            keyword_counts[word] += text.count(word)  # Count occurrences of the word in the text

    return keyword_counts

keyword_counts_deblasio = count_keywords(df_range1, key_words, 'Descriptor')
keyword_counts_adams = count_keywords(df_range2, key_words, 'Descriptor')

print("Keyword Counts for de Blasio Administration:")
for word, count in keyword_counts_deblasio.items():
    print(f"{word}: {count}")

print("\nKeyword Counts for Adams Administration:")
for word, count in keyword_counts_adams.items():
    print(f"{word}: {count}")

#I adapted this script from Rebecca Krisel's Intro to NLTK Workshop: https://github.com/rskrisel/intro_to_nltk/blob/main/Intro_NLTK_workshop.ipynb and prompted ChatGPT to update the code with two timeframes and my keyword list

In [None]:
sorted_deblasio_counts = sorted(keyword_counts_deblasio.items(), key=lambda x: x[1], reverse=True)
sorted_adams_counts = sorted(keyword_counts_adams.items(), key=lambda x: x[1], reverse=True)

deblasio_keywords, deblasio_counts = zip(*sorted_deblasio_counts)
adams_keywords, adams_counts = zip(*sorted_adams_counts)

plt.figure(figsize=(10, 6))
plt.bar(deblasio_keywords, deblasio_counts, color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.title("Keyword Counts for de Blasio Administration", fontsize=14)
plt.xlabel("Keywords", fontsize=12)
plt.ylabel("Counts", fontsize=12)
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
plt.bar(adams_keywords, adams_counts, color='lightcoral')
plt.xticks(rotation=45, ha='right')
plt.title("Keyword Counts for Adams Administration", fontsize=14)
plt.xlabel("Keywords", fontsize=12)
plt.ylabel("Counts", fontsize=12)
plt.tight_layout()
plt.show()

#I adapted this script from Rebecca Krisel's Pandas Workshop: https://github.com/rskrisel/pandas/blob/main/pandas_workshop_2024.ipynb

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# Convert 'Created Date' to datetime format if it's not already
df_filtered['Created Date'] = pd.to_datetime(df_filtered['Created Date'])

# Group by month, counting the number of complaints
monthly_trend = df_filtered.groupby(df_filtered['Created Date'].dt.to_period('M')).size()

# Set up the figure size for a larger plot
plt.figure(figsize=(12, 6))  # Increase the width and height for a larger graph

# Plotting the monthly trend
monthly_trend.plot(kind='line', color='b', linewidth=2)  # Use smooth lines with consistent color (blue)

# Adding title and labels
plt.title('Monthly Trend of Open Data 311 Complaints', fontsize=16)  # Title with larger font size
plt.xlabel('Month', fontsize=14)  # Label for the x-axis with larger font size
plt.ylabel('Number of Complaints', fontsize=14)  # Label for the y-axis with larger font size

# Adjust layout for better readability
plt.tight_layout()
plt.show()

# Plotting complaints per year
plt.figure(figsize=(10, 6))
plt.bar(complaints_per_year['Year'], complaints_per_year['Complaints'], color='green')
plt.title('311 Service Complaints Per Year')
plt.xlabel('Year')
plt.ylabel('Number of Complaints')
plt.xticks(rotation=45)
plt.tight_layout()  # Adjust layout for readability
plt.show()

#I adapted this script from Rebecca Krisel's Pandas Workshop: https://github.com/rskrisel/pandas/blob/main/pandas_workshop_2024.ipynb and prompted ChatGPT to adjust the layout for readability and to include the monthly trend as a line graph and the annual trend as a bar graph.