# City Pass Sentiment Analysis & Invitation Success Prediction

## Exploratory Data Analysis

This notebook explores the user profile data and message data to understand patterns and relationships that can be used for building a Graph Neural Network model to predict invitation success.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from collections import Counter
import re
import sys
import os

# Configure matplotlib
plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

# Add path to src directory
sys.path.append(os.path.abspath('../src'))

# Download necessary NLTK resources
try:
    nltk.data.find('tokenizers/punkt')
    nltk.data.find('corpora/stopwords')
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')

## Loading and Initial Exploration of Data

In [None]:
# Load data
user_data_path = '../data/user_data.csv'
message_data_path = '../data/message_data.csv'

user_data = pd.read_csv(user_data_path)
message_data = pd.read_csv(message_data_path)

print(f"User data shape: {user_data.shape}")
print(f"Message data shape: {message_data.shape}")

In [None]:
# Examine user data
print("User data info:")
user_data.info()
print("\nUser data sample:")
user_data.head()

In [None]:
# Examine message data
print("Message data info:")
message_data.info()
print("\nMessage data sample:")
message_data.head()

## Analysis of User Profiles

In [None]:
# Check for missing values in user data
print("Missing values in user data:")
print(user_data.isnull().sum())

# Fill missing values with empty string
user_data = user_data.fillna('')

In [None]:
# Text length analysis
user_data['profile_length'] = user_data['user_profile'].apply(len)
user_data['profile_word_count'] = user_data['user_profile'].apply(lambda x: len(x.split()))

# Plot distribution of profile lengths
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
sns.histplot(user_data['profile_length'], bins=30, kde=True, ax=ax1)
ax1.set_title('Distribution of User Profile Character Length')
ax1.set_xlabel('Character Count')

sns.histplot(user_data['profile_word_count'], bins=30, kde=True, ax=ax2)
ax2.set_title('Distribution of User Profile Word Count')
ax2.set_xlabel('Word Count')

plt.tight_layout()
plt.show()

In [None]:
# Function to calculate sentiment using TextBlob
def get_sentiment(text):
    if not text or pd.isna(text) or text.strip() == "":
        return 0.0
    return TextBlob(text).sentiment.polarity

# Calculate sentiment for user profiles
user_data['sentiment_score'] = user_data['user_profile'].apply(get_sentiment)

# Plot sentiment distribution
plt.figure(figsize=(10, 6))
sns.histplot(user_data['sentiment_score'], bins=30, kde=True)
plt.title('Distribution of User Profile Sentiment Scores')
plt.xlabel('Sentiment Score')
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5)
plt.show()

In [None]:
# Preprocess text for word frequency analysis
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords and lemmatize
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    
    return cleaned_tokens

# Combine all user profiles
all_profiles = ' '.join(user_data['user_profile'].fillna(''))
processed_words = preprocess_text(all_profiles)

# Count word frequencies
word_freq = Counter(processed_words)
common_words = word_freq.most_common(25)

# Plot most common words
plt.figure(figsize=(12, 8))
words, counts = zip(*common_words)
sns.barplot(x=list(counts), y=list(words))
plt.title('25 Most Common Words in User Profiles')
plt.xlabel('Frequency')
plt.tight_layout()
plt.show()

In [None]:
# Generate a word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=200, 
                      contour_width=3, contour_color='steelblue')
wordcloud.generate(' '.join(processed_words))

plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of User Profiles', fontsize=20)
plt.tight_layout()
plt.show()

## Analysis of Messages

In [None]:
# Check for missing values in message data
print("Missing values in message data:")
print(message_data.isnull().sum())

# Fill missing values with empty string
message_data = message_data.fillna('')

In [None]:
# Text length analysis for messages
message_data['message_length'] = message_data['message'].apply(len)
message_data['message_word_count'] = message_data['message'].apply(lambda x: len(str(x).split()))

# Plot distribution of message lengths
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
sns.histplot(message_data['message_length'], bins=30, kde=True, ax=ax1)
ax1.set_title('Distribution of Message Character Length')
ax1.set_xlabel('Character Count')

sns.histplot(message_data['message_word_count'], bins=30, kde=True, ax=ax2)
ax2.set_title('Distribution of Message Word Count')
ax2.set_xlabel('Word Count')

plt.tight_layout()
plt.show()

In [None]:
# Calculate sentiment for messages
message_data['sentiment_score'] = message_data['message'].apply(get_sentiment)

# Plot sentiment distribution
plt.figure(figsize=(10, 6))
sns.histplot(message_data['sentiment_score'], bins=30, kde=True)
plt.title('Distribution of Message Sentiment Scores')
plt.xlabel('Sentiment Score')
plt.axvline(x=0, color='r', linestyle='--', alpha=0.5)
plt.show()

In [None]:
# Compare sentiment scores between successful and unsuccessful invitations
plt.figure(figsize=(10, 6))
sns.boxplot(x='success', y='sentiment_score', data=message_data)
plt.title('Message Sentiment by Invitation Success')
plt.xlabel('Success (0=Failed, 1=Successful)')
plt.ylabel('Sentiment Score')
plt.show()

In [None]:
# Word frequency analysis for successful vs unsuccessful messages
successful_messages = ' '.join(message_data[message_data['success'] == 1]['message'].fillna(''))
unsuccessful_messages = ' '.join(message_data[message_data['success'] == 0]['message'].fillna(''))

successful_words = preprocess_text(successful_messages)
unsuccessful_words = preprocess_text(unsuccessful_messages)

# Count word frequencies
successful_word_freq = Counter(successful_words)
unsuccessful_word_freq = Counter(unsuccessful_words)

common_successful = successful_word_freq.most_common(15)
common_unsuccessful = unsuccessful_word_freq.most_common(15)

# Plot most common words in successful messages
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 8))

words_success, counts_success = zip(*common_successful)
sns.barplot(x=list(counts_success), y=list(words_success), color='green', ax=ax1)
ax1.set_title('Most Common Words in Successful Invitations')
ax1.set_xlabel('Frequency')

words_fail, counts_fail = zip(*common_unsuccessful)
sns.barplot(x=list(counts_fail), y=list(words_fail), color='red', ax=ax2)
ax2.set_title('Most Common Words in Unsuccessful Invitations')
ax2.set_xlabel('Frequency')

plt.tight_layout()
plt.show()

## Analysis of User-Message Relationships

In [None]:
# Success rate analysis
overall_success_rate = message_data['success'].mean()
print(f"Overall invitation success rate: {overall_success_rate:.2%}")

# Success rate by sender
sender_success = message_data.groupby('sid')['success'].agg(['count', 'mean'])
sender_success.columns = ['num_invitations', 'success_rate']
sender_success = sender_success.sort_values('num_invitations', ascending=False)

# Plot success rate by top senders
top_senders = sender_success.head(20).reset_index()
plt.figure(figsize=(12, 6))
plt.scatter(top_senders['num_invitations'], top_senders['success_rate'], alpha=0.7)
plt.axhline(y=overall_success_rate, color='r', linestyle='--', alpha=0.5, label=f'Overall: {overall_success_rate:.2%}')
plt.title('Invitation Success Rate by Top 20 Senders')
plt.xlabel('Number of Invitations Sent')
plt.ylabel('Success Rate')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Success rate by receiver
receiver_success = message_data.groupby('rid')['success'].agg(['count', 'mean'])
receiver_success.columns = ['num_received', 'success_rate']
receiver_success = receiver_success.sort_values('num_received', ascending=False)

# Plot success rate by top receivers
top_receivers = receiver_success.head(20).reset_index()
plt.figure(figsize=(12, 6))
plt.scatter(top_receivers['num_received'], top_receivers['success_rate'], alpha=0.7)
plt.axhline(y=overall_success_rate, color='r', linestyle='--', alpha=0.5, label=f'Overall: {overall_success_rate:.2%}')
plt.title('Invitation Success Rate by Top 20 Receivers')
plt.xlabel('Number of Invitations Received')
plt.ylabel('Success Rate')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Analyze relationship between sender and receiver sentiment
# Merge user data with message data
merged_data = pd.merge(message_data, user_data, left_on='sid', right_on='uid', suffixes=('', '_sender'))
merged_data = pd.merge(merged_data, user_data, left_on='rid', right_on='uid', suffixes=('', '_receiver'))

# Rename columns for clarity
merged_data = merged_data.rename(columns={
    'sentiment_score': 'message_sentiment',
    'sentiment_score_sender': 'sender_sentiment',
    'sentiment_score_receiver': 'receiver_sentiment'
})

# Plot relationship between sender and message sentiment
plt.figure(figsize=(10, 6))
plt.scatter(merged_data['sender_sentiment'], merged_data['message_sentiment'], 
            c=merged_data['success'], cmap='coolwarm', alpha=0.5)
plt.colorbar(label='Success')
plt.title('Relationship Between Sender Sentiment and Message Sentiment')
plt.xlabel('Sender Profile Sentiment')
plt.ylabel('Message Sentiment')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Plot relationship between message sentiment and success
plt.figure(figsize=(10, 6))
bins = np.linspace(-1, 1, 20)
success_by_sentiment = merged_data.groupby(pd.cut(merged_data['message_sentiment'], bins))['success'].mean()
success_by_sentiment.plot(kind='bar')
plt.title('Invitation Success Rate by Message Sentiment')
plt.xlabel('Message Sentiment')
plt.ylabel('Success Rate')
plt.axhline(y=overall_success_rate, color='r', linestyle='--', alpha=0.5, label=f'Overall: {overall_success_rate:.2%}')
plt.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Graph Structure Analysis

In [None]:
# Create a graph from user and message data
G = nx.DiGraph()  # Directed graph

# Add nodes (users)
for user_id in user_data['uid']:
    G.add_node(user_id)

# Add edges (messages)
for _, row in message_data.iterrows():
    sender_id = row['sid']
    receiver_id = row['rid']
    success = row['success']
    
    # Only add edges if both users exist in the graph
    if sender_id in G.nodes and receiver_id in G.nodes:
        G.add_edge(sender_id, receiver_id, success=success)

# Graph statistics
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")
print(f"Graph density: {nx.density(G):.6f}")

# Compute degree metrics
in_degrees = dict(G.in_degree())
out_degrees = dict(G.out_degree())

print(f"Average in-degree: {np.mean(list(in_degrees.values())):.2f}")
print(f"Average out-degree: {np.mean(list(out_degrees.values())):.2f}")

In [None]:
# Plot degree distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

in_degree_values = list(in_degrees.values())
out_degree_values = list(out_degrees.values())

sns.histplot(in_degree_values, bins=30, kde=True, ax=ax1)
ax1.set_title('In-degree Distribution')
ax1.set_xlabel('In-degree')
ax1.set_ylabel('Count')

sns.histplot(out_degree_values, bins=30, kde=True, ax=ax2)
ax2.set_title('Out-degree Distribution')
ax2.set_xlabel('Out-degree')
ax2.set_ylabel('Count')

plt.tight_layout()
plt.show()

In [None]:
# Analyze connection between node importance and success rate
# Compute PageRank
pagerank = nx.pagerank(G)

# Compute success rates for each sender
sender_success_rates = {}
for sender_id in G.nodes():
    outgoing_edges = list(G.out_edges(sender_id, data=True))
    if outgoing_edges:
        success_count = sum(1 for _, _, data in outgoing_edges if data['success'] == 1)
        sender_success_rates[sender_id] = success_count / len(outgoing_edges)
    else:
        sender_success_rates[sender_id] = 0

# Create a DataFrame for visualization
node_metrics = pd.DataFrame({
    'user_id': list(G.nodes()),
    'in_degree': [in_degrees[node] for node in G.nodes()],
    'out_degree': [out_degrees[node] for node in G.nodes()],
    'pagerank': [pagerank[node] for node in G.nodes()],
    'success_rate': [sender_success_rates[node] for node in G.nodes()]
})

# Plot PageRank vs Success Rate
plt.figure(figsize=(10, 6))
plt.scatter(node_metrics['pagerank'], node_metrics['success_rate'], alpha=0.5)
plt.title('Relationship Between Node PageRank and Invitation Success Rate')
plt.xlabel('PageRank')
plt.ylabel('Success Rate')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Visualize the graph (sample for better visualization)
# Take a subgraph of top nodes based on PageRank for visualization
top_nodes = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:50]
top_node_ids = [node for node, _ in top_nodes]
subgraph = G.subgraph(top_node_ids)

plt.figure(figsize=(14, 10))
pos = nx.spring_layout(subgraph, seed=42)

# Draw nodes with size proportional to PageRank
node_sizes = [pagerank[node] * 10000 for node in subgraph.nodes()]
nx.draw_networkx_nodes(subgraph, pos, node_size=node_sizes, alpha=0.7, node_color='skyblue')

# Draw edges with color based on success (green for success, red for failure)
edges_success = [(u, v) for u, v, d in subgraph.edges(data=True) if d['success'] == 1]
edges_failure = [(u, v) for u, v, d in subgraph.edges(data=True) if d['success'] == 0]

nx.draw_networkx_edges(subgraph, pos, edgelist=edges_success, edge_color='green', alpha=0.5, arrows=True)
nx.draw_networkx_edges(subgraph, pos, edgelist=edges_failure, edge_color='red', alpha=0.5, arrows=True)

# Add node labels for top 10 nodes by PageRank
top_10_nodes = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:10]
top_10_node_ids = [node for node, _ in top_10_nodes]
labels = {node: str(node) for node in top_10_node_ids if node in subgraph.nodes()}
nx.draw_networkx_labels(subgraph, pos, labels=labels, font_size=10)

plt.title('Network of User Invitations (Top 50 Users by PageRank)')
plt.axis('off')
plt.tight_layout()
plt.show()

## Summary of Findings

This exploratory analysis has revealed several patterns in the data that can inform our GNN model design:

1. **User Profiles:**
   - User profiles generally have positive sentiment, reflecting enthusiasm for travel experiences.
   - Common themes include exploring new cultures, trying local cuisine, and making memories.

2. **Messages:**
   - Message sentiment appears to have a relationship with invitation success.
   - Successful messages tend to use different language compared to unsuccessful ones.
   - Message length may be a factor in invitation success.

3. **User-Message Relationships:**
   - Some users have higher success rates than others when sending invitations.
   - There appears to be a correlation between sender sentiment and message sentiment.
   - Certain users are more likely to accept invitations than others.

4. **Graph Structure:**
   - The user network shows typical characteristics of social networks (e.g., power-law degree distribution).
   - Node centrality metrics like PageRank may be useful features for predicting invitation success.
   - The graph is relatively sparse, suggesting that most users interact with only a small subset of other users.

These insights will guide our feature engineering and GNN architecture design for predicting invitation success.