# **CSE 5095: Social Media Mining and Analysis**
Fall 2024, Assignment #1, 200 points


In this assignment, we will explore the statistical properties of the quantitative features associated with each subreddit in your data set. Each data set has observations from two subreddits. In some data sets, each observation is a post, whereas for the other data sets each observation is a compilation of comments for each unique post.


In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import ttest_ind, mannwhitneyu
import seaborn as sns
from scipy.stats import skew

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE

### **Task 1: Descriptive Statistics (50 points)**
Build a table with the descriptive statistics (mean and variance) of the quantitative features for each
subreddit in your data set. For projects based on post-level data, these will include the post-level statistics
shown in Table 1. For projects based on comment data, these include the average of average commentlevel statistics as shown in Table 2 (average comment and user statistics are computed for each post). For
comment data, include also user-level features listed in Table 2

In [4]:
def calc_stats(group):
    return pd.DataFrame({
        'mean': group[features].mean(),
        'variance': group[features].var()
    })


df = pd.read_csv('project10.csv')

post_features = ['post_score', 'post_upvote_ratio', 'post_thumbs_ups', 'post_total_awards_received']
comment_features = ['score', 'controversiality', 'ups', 'downs']
user_features = ['user_awardee_karma', 'user_awarder_karma', 'user_link_karma', 'user_comment_karma', 'user_total_karma']

is_post_level = all(feature in df.columns for feature in post_features)
if is_post_level:
    features = post_features + user_features
else:
    features = comment_features + user_features

stats = df.groupby('subreddit').apply(calc_stats).reset_index()
stats = stats.pivot(index='level_1', columns='subreddit', values=['mean', 'variance'])

subreddits = stats.columns.get_level_values(1).unique()
new_order = [(stat, subreddit) for subreddit in subreddits for stat in ['mean', 'variance']]
stats = stats.reindex(columns=new_order)

stats = stats.round(2)
stats.index.name = 'Feature'
stats.columns.names = ['Statistic', 'Subreddit']

stats.to_csv('subreddit_statistics.csv')

  stats = df.groupby('subreddit').apply(calc_stats).reset_index()


### **Task 2: Distributions (50 points)**

For each quantitative feature, plot the two distributions corresponding to the two subreddits. Comment on
the properties of each distribution (symmetrical, left-skewed, right-skewed), and how they compare with
each other. 


In [5]:
def is_skewed(data):
    return abs(stats.skew(data)) > 1

# Function to determine skewness
def get_skewness(data):
    skewness = skew(data)
    if abs(skewness) < 0.5:
        return "approximately symmetrical"
    elif skewness < 0:
        return "left-skewed"
    else:
        return "right-skewed"

# Function to plot distributions and print comments
def plot_and_comment(feature):
    plt.figure(figsize=(12, 6))
    for subreddit in df['subreddit'].unique():
        data = df[df['subreddit'] == subreddit][feature]
        sns.histplot(data, kde=True, label=subreddit)
        
        # Calculate and print statistics
        mean = data.mean()
        median = data.median()
        skewness = get_skewness(data)
        print(f"\n{subreddit} - {feature}:")
        print(f"  Mean: {mean:.2f}")
        print(f"  Median: {median:.2f}")
        print(f"  Distribution: {skewness}")
        print(f"  Range: {data.min():.2f} to {data.max():.2f}")
        
    plt.title(f'Distribution of {feature} by Subreddit')
    plt.xlabel(feature)
    plt.ylabel('Density')
    plt.legend()
    plt.savefig(f'{feature}_distribution.png')
    plt.close()

# Plot distributions for each feature
for feature in features:
    plot_and_comment(feature)


science - post_score:
  Mean: 180.31
  Median: 60.00
  Distribution: right-skewed
  Range: 0.00 to 4951.00

action - post_score:
  Mean: 94.45
  Median: 41.00
  Distribution: right-skewed
  Range: 0.00 to 880.00

science - post_upvote_ratio:
  Mean: 0.93
  Median: 0.96
  Distribution: left-skewed
  Range: 0.04 to 1.00

action - post_upvote_ratio:
  Mean: 0.93
  Median: 0.97
  Distribution: left-skewed
  Range: 0.12 to 1.00

science - post_thumbs_ups:
  Mean: 180.31
  Median: 60.00
  Distribution: right-skewed
  Range: 0.00 to 4951.00

action - post_thumbs_ups:
  Mean: 94.45
  Median: 41.00
  Distribution: right-skewed
  Range: 0.00 to 880.00

science - post_total_awards_received:
  Mean: 0.00
  Median: 0.00
  Distribution: right-skewed
  Range: 0.00 to 0.00

action - post_total_awards_received:
  Mean: 0.00
  Median: 0.00
  Distribution: right-skewed
  Range: 0.00 to 0.00

science - user_awardee_karma:
  Mean: 1944.27
  Median: 127.00
  Distribution: right-skewed
  Range: 0.00 to 1544

### **Task 3: Statistical Significance (25 points)**
For each quantitative feature from task 1, assess the statistical significance (at 5% level) among the two
subreddits. Refer to the distributions of each feature in Task 2 to determine which statistical test would be
the most appropriate, for example, if the data follows a near-symmetric distribution them the t-test might be
the most appropriate. On the other hand, if the data follows a highly skewed distribution, then a nonparametric test will be appropriate.

In [8]:
# Function to determine skewness
def is_skewed(data):
    return abs(data.skew()) > 1

# Function to perform statistical test
def perform_test(feature):
    subreddit1, subreddit2 = df['subreddit'].unique()
    data1 = df[df['subreddit'] == subreddit1][feature]
    data2 = df[df['subreddit'] == subreddit2][feature]
    
    # Check for skewness
    if is_skewed(data1) or is_skewed(data2):
        # If skewed, use Mann-Whitney U test
        statistic, p_value = mannwhitneyu(data1, data2)
        test_name = "Mann-Whitney U test"
    else:
        # If not skewed, use t-test
        statistic, p_value = ttest_ind(data1, data2)
        test_name = "T-test"
    
    return test_name, statistic, p_value

# Perform tests for each feature
results = []
for feature in features:
    test_name, statistic, p_value = perform_test(feature)
    significant = "Yes" if p_value < 0.05 else "No"
    results.append({
        'Feature': feature,
        'Test': test_name,
        'Statistic': statistic,
        'P-value': p_value,
        'Significant at 5% level': significant
    })

# Create and display results DataFrame
results_df = pd.DataFrame(results)
results_df = results_df.set_index('Feature')
print(results_df.to_string())

# Save results to CSV
results_df.to_csv('statistical_test_results.csv')
print("\nResults have been saved to 'statistical_test_results.csv'")

                                           Test  Statistic       P-value Significant at 5% level
Feature                                                                                         
post_score                  Mann-Whitney U test  1366425.0  3.433923e-16                     Yes
post_upvote_ratio           Mann-Whitney U test  1068302.0  2.153275e-04                     Yes
post_thumbs_ups             Mann-Whitney U test  1366425.0  3.433923e-16                     Yes
post_total_awards_received               T-test        NaN           NaN                      No
user_awardee_karma          Mann-Whitney U test  1248807.5  4.322622e-04                     Yes
user_awarder_karma          Mann-Whitney U test  1220697.0  8.009958e-03                     Yes
user_link_karma             Mann-Whitney U test  1201743.0  1.060519e-01                      No
user_comment_karma          Mann-Whitney U test  1454788.0  1.859565e-31                     Yes
user_total_karma            Ma

In [9]:
significant_features = results_df[results_df['Significant at 5% level'] == 'Yes'].index

for feature in significant_features:
    plt.figure(figsize=(10, 6))
    sns.histplot(data=df, x=feature, hue='subreddit', kde=True, element="step")
    plt.title(f'Distribution of {feature} by Subreddit')
    plt.savefig(f'{feature}_distribution.png')
    plt.close()

In [10]:
import pandas as pd
import numpy as np
from scipy import stats
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
df = pd.read_csv('project10.csv')

# Convert post_created_time to datetime
df['post_created_time'] = pd.to_datetime(df['post_created_time'])

# Feature Engineering
def compute_aggregated_time_features(df):
    features = {}
    
    # 1. Hour of day distribution
    hour_dist = df['post_created_time'].dt.hour.value_counts(normalize=True).sort_index()
    features['peak_hour'] = hour_dist.idxmax()
    features['hour_entropy'] = stats.entropy(hour_dist)
    
    # 2. Day of week distribution
    dow_dist = df['post_created_time'].dt.dayofweek.value_counts(normalize=True).sort_index()
    features['peak_day'] = dow_dist.idxmax()
    features['day_entropy'] = stats.entropy(dow_dist)
    
    # 3. Month distribution
    month_dist = df['post_created_time'].dt.month.value_counts(normalize=True).sort_index()
    features['peak_month'] = month_dist.idxmax()
    features['month_entropy'] = stats.entropy(month_dist)
    
    # 4. Posting regularity
    time_diffs = df['post_created_time'].sort_values().diff().dt.total_seconds() / 3600  # in hours
    features['mean_time_between_posts'] = time_diffs.mean()
    features['std_time_between_posts'] = time_diffs.std()
    
    # 5. Temporal density
    time_range = (df['post_created_time'].max() - df['post_created_time'].min()).total_seconds() / 3600 / 24  # in days
    features['posts_per_day'] = len(df) / time_range if time_range > 0 else 0
    
    # 6. Burstiness
    if len(time_diffs) > 1:
        features['burstiness'] = (time_diffs.std() - time_diffs.mean()) / (time_diffs.std() + time_diffs.mean())
    else:
        features['burstiness'] = 0
    
    return pd.Series(features)

# Compute features for each subreddit
aggregated_features = df.groupby('subreddit').apply(compute_aggregated_time_features).reset_index()

# Function to test statistical significance
def test_significance(feature):
    data1 = aggregated_features.loc[aggregated_features['subreddit'] == aggregated_features['subreddit'].iloc[0], feature]
    data2 = aggregated_features.loc[aggregated_features['subreddit'] == aggregated_features['subreddit'].iloc[1], feature]
    
    # Since we have only one value per subreddit, we can't perform statistical tests
    # Instead, we'll calculate the absolute difference and relative difference
    abs_diff = abs(data1.iloc[0] - data2.iloc[0])
    rel_diff = abs_diff / ((data1.iloc[0] + data2.iloc[0]) / 2) if (data1.iloc[0] + data2.iloc[0]) != 0 else 0
    
    return abs_diff, rel_diff

# Test significance for each new feature
results = []

for feature in aggregated_features.columns[2:]:  # Skip 'subreddit' and 'level_1' columns
    abs_diff, rel_diff = test_significance(feature)
    results.append({
        'Feature': feature,
        'Absolute Difference': abs_diff,
        'Relative Difference': rel_diff
    })

# Create and display results DataFrame
results_df = pd.DataFrame(results)
results_df = results_df.set_index('Feature')
print(results_df.to_string())

# Save results to CSV
results_df.to_csv('time_feature_differences.csv')
print("\nResults have been saved to 'time_feature_differences.csv'")

# Visualize features
plt.figure(figsize=(12, 8))
sns.heatmap(aggregated_features.set_index('subreddit').iloc[:, 1:], annot=True, cmap='coolwarm', center=0)
plt.title('Aggregated Time Features by Subreddit')
plt.tight_layout()
plt.savefig('time_features_heatmap.png')
plt.close()

print("\nHeatmap of aggregated time features has been saved as 'time_features_heatmap.png'.")

                         Absolute Difference  Relative Difference
Feature                                                          
hour_entropy                        0.096639             0.032373
peak_day                            5.000000             1.428571
day_entropy                         0.003618             0.001885
peak_month                          0.000000             0.000000
month_entropy                       0.502769             0.227961
mean_time_between_posts            19.393762             1.632588
std_time_between_posts             24.078345             1.591799
posts_per_day                       9.889222             1.632416
burstiness                          0.057072             0.398505

Results have been saved to 'time_feature_differences.csv'

Heatmap of aggregated time features has been saved as 'time_features_heatmap.png'.


  aggregated_features = df.groupby('subreddit').apply(compute_aggregated_time_features).reset_index()


In [1]:
!jupyter nbconvert my-notebook.ipynb --to python

usage: jupyter [-h] [--version] [--config-dir] [--data-dir] [--runtime-dir]
               [--paths] [--json] [--debug]
               [subcommand]

Jupyter: Interactive Computing

positional arguments:
  subcommand     the subcommand to launch

options:
  -h, --help     show this help message and exit
  --version      show the versions of core jupyter packages and exit
  --config-dir   show Jupyter config dir
  --data-dir     show Jupyter data dir
  --runtime-dir  show Jupyter runtime dir
  --paths        show all Jupyter paths. Add --json for machine-readable
                 format.
  --json         output paths as machine-readable json
  --debug        output debug information about paths

Available subcommands: kernel kernelspec migrate run troubleshoot

Jupyter command `jupyter-nbconvert` not found.
