## **Project: Stock Market Analysis Using Twitter Data**

### **🔍 Objective**

The purpose of this project is to analyze Twitter conversations related to the stock market—particularly around companies like **Tesla, Apple, Amazon, Microsoft, and the SPY ETF**—to understand how public sentiment aligns with stock market behavior. By performing sentiment analysis on relevant tweets, this project aims to uncover patterns in public opinion that may reflect investor mood and market trends.

### **📥 Data Collection**
Tweets were retrieved using the **Twitter API** by querying the hashtag `#StockMarket`. The collected tweets include discussions about the overall market and specific companies, as well as common trading sentiments. The dataset captures both the content and engagement of each tweet, such as retweets and likes.

### **Importing Libraries**

In [None]:
import tweepy
import pandas as pd

### **Connecting to Twitter API Using Bearer Token**


In [None]:
# Initializing Tweepy Client for Twitter API Access
bearer_token = 'AAAAAAAAAAAAAAAAAAAAAL0D0AEAAAAAVaybQyI7NZ32U6xwspgMRlsO7V8%3DNuFRl0OYpoIKfLhj0lVQnxMW2xkJufhVUjkbVFyHmkmeZqHIC7'
client = tweepy.Client(bearer_token = bearer_token, wait_on_rate_limit = True)

### **Fetching Tweets From Twitter API**

In [None]:
# Define the function to fetch tweets
def fetch_tweets(client, query, max_results=60):
    tweets = []

    # Fetch recent tweets based on the query
    response = client.search_recent_tweets(
        query=query + " -is:retweet lang:en",  # Correct way to add language filter
        tweet_fields=['created_at', 'public_metrics', 'author_id', 'text'],
        max_results=max_results
    )

    # Check if the response has data
    if response.data:
        for tweet in response.data:
            tweets.append({
                'created_at': tweet.created_at,
                'text': tweet.text,
                'retweets': tweet.public_metrics['retweet_count'],
                'likes': tweet.public_metrics['like_count']
            })

    # Return a DataFrame with the tweets
    return pd.DataFrame(tweets)

# Define the query for stock-related tweets
query = "#StockMarket"  # Using hashtag for stock-related tweets

# Fetch tweets and store in a DataFrame
tweets_df = fetch_tweets(client, query, max_results=60)

# Check if the DataFrame is empty and print the results
if not tweets_df.empty:
    print(tweets_df)
else:
    print("No tweets were found for the given query.")

### 💾 **Saving Retrieved Tweets to CSV Files**

In [None]:
# Exporting Retrieved Tweets to CSV Files
tweets_df.to_csv('stock_tweets.csv')
tweets_df2.to_csv('stockmarket_new_tweets.csv')
tweets_df3.to_csv('stockmarket_60new_tweets.csv')
tweets_df4.to_csv('stockmarket_tweets.csv')

The collected tweets were saved into multiple CSV files after being retrieved at different times.

### **Merging Tweet Datasets**

In [None]:
tweets_df = pd.read_csv('/content/stock_tweets.csv')
tweets_df2 = pd.read_csv('/content/stockmarket_new_tweets.csv')
tweets_df3 = pd.read_csv('/content/stockmarket_60new_tweets.csv')
tweets_df4 = pd.read_csv('/content/stockmarket_tweets.csv')

# Combine them into a single DataFrame
combined_tweets_df = pd.concat([tweets_df, tweets_df2, tweets_df3, tweets_df4], ignore_index=True)

Merged the multiple tweet datasets into one unified DataFrame for easier analysis and processing.

### **Stock Market Tweets Dataset After Merging**

In [None]:
print(combined_tweets_df.head())

This dataset includes all the tweets related to stock market discussions about companies like Tesla, Apple, Amazon, Microsoft, and the SPY ETF. The data is combined into one file, with tweet text and engagement metrics (such as retweets and likes), and is ready for cleaning and further analysis.

**📊 Dataset Columns**

**created_at:** Date the tweet was posted

**text:** Original tweet text

**retweets:** Number of times the tweet was retweeted

**likes:** Number of likes

In [None]:
print(combined_tweets_df.shape)

Dataset contains 180 rows (tweets) and 5 columns (tweet-related data including created_date, text, engagement metrics such as likes, and retweets).



### **Remove Duplicate Tweets**

In [None]:
# Remove duplicates (Optional)
combined_tweets_df.drop_duplicates(inplace=True)

In [None]:
# Save to new CSV (optional)
combined_tweets_df.to_csv("all_stock_tweets_API.csv", index=False)

print("Combined text files saved to 'all_stock_tweets_API.csv'")

# Check result
print(combined_tweets_df.shape)

### **Most Frequent Words Before Cleaning**

In [None]:
from collections import Counter
word_freq = Counter(" ".join(combined_tweets_df["text"]).split())
print("Most common words:", word_freq.most_common(20))

### **Text Cleaning**

**For Text Cleaning:**
- **http\S+**: Removes URLs
- **@\S+**: Removes mentions
- **#\S+**: Removes hashtags
- **[^A-Za-z0-9 ]+**: Removes special characters
- **.lower()**: Converts text to lowercase
- **Removing Stopwords**: "the", "is", "and"
- **Lemmatization**: "running" → "run"

In [None]:
# Download necessary NLTK resources
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

In stock-related tweets, symbols like `#` and `$` are used to represent companies and market trends, such as $AMZN for Amazon's stock. These symbols are important for understanding the topic of the tweet and for analyzing market discussions. Removing them would strip away key information about the stock references, which are crucial for our analysis.

### **Text Cleaning Without Removing '$' and '#'**

In [None]:
# Without Removing '$' and '#'
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Function to clean text
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove URLs and mentions, but preserve hashtags (#) and dollar signs ($)
    text = re.sub(r'http\S+|www\.\S+|@\S+', '', text)  # Remove URLs and mentions, keep hashtags and $tickers

    # Remove HTML tags (do not remove hashtags and $tickers)
    text = re.sub(r'<.*?>', '', text)

    # Remove punctuation (but keep hashtags and $tickers intact)
    text = text.translate(str.maketrans('', '', string.punctuation.replace('#', '').replace('$', '')))

    # Remove numbers and words with numbers, except for stock tickers
    text = re.sub(r'\w*\d\w*', '', text)

    # Remove extra whitespaces and newlines
    text = re.sub(r'\s+', ' ', text).strip()

    # Tokenization and stopword removal
    stop_words = set(stopwords.words('english'))
    words = text.split()

    # Preserve words starting with # or $ (for stock tickers and hashtags)
    words = [word for word in words if word not in stop_words or word.startswith(('#', '$'))]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return ' '.join(words)

# Example Usage
# Applying the clean_text function to each tweet in the 'text' column
combined_tweets_df["clean_text"] = combined_tweets_df["text"].apply(clean_text)

# Print the cleaned text of the first 20 rows
print(combined_tweets_df[["text", "clean_text"]].head(20))

### **Cleaned Text File**

In [None]:
# Save the cleaned text to a CSV file
combined_tweets_df[["clean_text"]].to_csv("stockmarket_cleanedtext.csv", index=False)

print("Cleaned text saved to 'stockmarket_cleanedtext.csv'")

### **Most Frequent Words After Cleaning**

In [None]:
from collections import Counter
word_freq = Counter(" ".join(combined_tweets_df["clean_text"]).split())
most_common_words = word_freq.most_common(20)
print("Most common words:", most_common_words)

### **Visualization using Most Frequent Words (Bar Chart)**


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Convert to DataFrame for plotting
freq_df = pd.DataFrame(most_common_words, columns=['word', 'count'])

# Plotting
plt.figure(figsize=(12, 6))
sns.barplot(x='count', y='word', data=freq_df, palette='viridis')
plt.title("Top 20 Most Frequent Words in Tweets")
plt.xlabel("Frequency")
plt.ylabel("Word")
plt.tight_layout()
plt.show()

The most frequently occurring words in the stock market-related tweets include hashtags like **`#stockmarket`, `#stocks`, and `#market`**, which help identify the key topics being discussed. Other common terms refer to trending issues such as #stockmarketcrash, **`#trump`, `#investing`, and `#trading`**, as well as mentions of assets like **`#bitcoin`, `#crypto`, and `#gold`**. These frequently used words highlight the main themes and discussions within the stock market conversations captured in the tweets.

### **Word Cloud**

In [None]:
!pip install wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine all the clean tweets into one big string
text = " ".join(combined_tweets_df["clean_text"])

# Create the WordCloud object
wordcloud = WordCloud(
    width=800,
    height=400,
    background_color='white',
    colormap='viridis',     # Optional: choose any matplotlib colormap
    max_words=100           # Adjust for more/less words
).generate(text)

# Plot it
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Most Frequent Words in Tweets", fontsize=16)
plt.tight_layout()
plt.show()

This word cloud represents the most frequently used words in tweets related to the stock market. Keywords like **“stockmarket”**, **“tariff”**, “**trump”, “trading”, and “investing**” highlight trending topics and public sentiment. The visualization provides a quick overview of what people are discussing in real time, offering valuable insights for further stock market analysis based on Twitter data.

### **Sentiment Analysis**



### **1. TextBlob Sentiment Analysis**

In [None]:
pip install textblob

In [None]:
from textblob import TextBlob

combined_tweets_df["Sentiment"] = combined_tweets_df["clean_text"].apply(lambda x: TextBlob(x).sentiment.polarity)
print(combined_tweets_df["Sentiment"].describe())

**Analysis Summary:**
- Total Tweets Analyzed: 180

- Average Sentiment (Mean): ~0.023 → Slightly positive overall.

- Standard Deviation: ~0.231 → There's a fair bit of variation in sentiment.

- Minimum Sentiment Score: -1.0 → Strongly negative tweet(s) present.

- Maximum Sentiment Score: 1.0 → Strongly positive tweet(s) present.

- Median (50%): 0.0 → Half of the tweets are neutral.

- 25th Percentile: 0.0

- 75th Percentile: ~0.07 → Most tweets are clustered between neutral and slightly positive.

In [None]:
# Histogram of Sentiment Scores
combined_tweets_df["Sentiment"].hist(bins=30, alpha=0.5, label='Tweepy')
plt.legend()
plt.title("Histogram of Sentiment Scores")
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.show()

**Interpretation:**

The sentiment analysis using **TextBlob** shows that most tweets have a neutral tone, with sentiment scores clustered heavily around 0, as seen in the histogram. While there are a few strongly positive or negative tweets (scores near 1.0 or -1.0), the overall distribution suggests minimal emotional intensity in the majority of tweets.

However, this outcome may be influenced by the limitations of **TextBlob**:

- It misses sentiment in social media text that includes slang, emojis, hashtags, and other informal elements.

- It may misinterpret or ignore context, especially with sarcasm, abbreviations, or short expressions commonly found on Twitter.

Because of these limitations, the analysis might not fully capture the true emotional tone of tweets. To address this, moving forward with **VADER** Sentiment Analysis, which is specifically designed for short, informal, and social media-style text.


### **2. VADER Sentiment Analysis**

In [None]:
pip install vaderSentiment

Here I'm using **VADER Sentiment Analysis (Valence Aware Dictionary and sEntiment Reasoner)** to classify the sentiment of tweets in the dataset.

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Initialize VADER
analyzer = SentimentIntensityAnalyzer()

# Define a function to classify sentiment
def get_sentiment(text):
    score = analyzer.polarity_scores(text)['compound']
    if score >= 0.05:
        return 'positive'
    elif score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Apply to your DataFrame
combined_tweets_df['sentiment'] = combined_tweets_df['clean_text'].apply(get_sentiment)


In [None]:
combined_tweets_df.head(5)

### **Bar chart of Sentiment counts**

In [None]:
# Plotting a bar chart showing the number of tweets in each sentiment category
combined_tweets_df['sentiment'].value_counts().plot(kind='bar')

**VADER Sentiment Analysis (Bar Chart)**

- This bar chart shows tweet sentiment using **VADER**, which is designed for social media text. It detects more positive tweets, with neutral and negative sentiments close behind.

- **VADER** outperforms **TextBlob** by handling slang, emojis, and informal language, giving a more accurate picture of public opinion on the stock market.



### Separating Tweets by sentiment (neutral, positive, and negative tweets)

In [None]:
# Separate tweets by sentiment
positive_tweets = combined_tweets_df[combined_tweets_df['sentiment'] == 'positive']
neutral_tweets = combined_tweets_df[combined_tweets_df['sentiment'] == 'neutral']
negative_tweets = combined_tweets_df[combined_tweets_df['sentiment'] == 'negative']


Here, I wanted to visualize frequent words in each **sentiment group(positive_tweets, neutral_tweets, negative_tweets)** using **WordCloud** of the top terms.

In [None]:
# Combine all clean_text for each sentiment
positive_words = ' '.join(positive_tweets['clean_text'])
neutral_words = ' '.join(neutral_tweets['clean_text'])
negative_words = ' '.join(negative_tweets['clean_text'])

# Function to create wordcloud
def show_wordcloud(text, title):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=16)
    plt.show()

# Generate wordclouds
show_wordcloud(positive_words, 'Frequent Words in Positive Tweets')
show_wordcloud(neutral_words, 'Frequent Words in Neutral Tweets')
show_wordcloud(negative_words, 'Frequent Words in Negative Tweets')


### **Emotion Analysis Using Lexicon-Based Approach**

In [None]:
pip install nltk

In [None]:
import nltk
import pandas as pd
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download necessary resources from NLTK
nltk.download('opinion_lexicon')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

In [None]:
# Define a Function for Emotion Analysis

emotion_lexicon = {
    'joy': [
        'gain', 'profit', 'bullish', 'win', 'green', 'soaring', 'surge', 'rally', 'strong',
        'record', 'positive', 'optimistic', 'uptrend', 'stockstobuy', 'invest', 'buythedip',
        '#gain', '#profit', '#bullish', '#win', '#green', '#soaring', '#surge', '#rally', '#strong',
        '#record', '#positive', '#optimistic', '#uptrend', '#stockstobuy', '#invest', '#buythedip'
    ],
    'anger': [
        'fraud', 'scam', 'manipulated', 'rigged', 'corrupt', 'unfair', 'angry', 'mad', 'rage', 'annoyed',
        '#fraud', '#scam', '#manipulated', '#rigged', '#corrupt', '#unfair', '#angry', '#mad', '#rage', '#annoyed'
    ],
    'sadness': [
        'loss', 'lost', 'drop', 'down', 'red', 'slowdown', 'crash', 'stockmarketcrash', 'fall', 'decline',
        'disappointed', 'depressed', 'regret', 'downtrend', 'selloff', 'worst', 'hit',
        '#loss', '#lost', '#drop', '#down', '#red', '#slowdown', '#crash', '#stockmarketcrash',
        '#fall', '#decline', '#disappointed', '#depressed', '#regret', '#downtrend', '#selloff'
    ],
    'fear': [
        'fear', 'scared', 'panic', 'nervous', 'uncertain', 'volatility', 'risky', 'warning',
        'concern', 'bearish', 'doubt', 'recession', 'losing',
        '#fear', '#scared', '#panic', '#nervous', '#uncertain', '#volatility', '#risky',
        '#warning', '#concern', '#bearish', '#doubt', '#recession', '#losing'
    ],
    'surprise': [
        'unexpected', 'shock', 'surprised', 'amazement', 'jump', 'plummet', 'news', 'announcement', 'earningssurprise',
        '#unexpected', '#shock', '#surprised', '#amazement', '#jump', '#plummet', '#news', '#announcement', '#earningssurprise'
    ],
    'trust': [
        'confidence', 'believe', 'support', 'faith', 'trust', 'solid', 'stable', 'resilient',
        '#confidence', '#believe', '#support', '#faith', '#trust', '#solid', '#stable', '#resilient'
    ]
}


# Define the emotion classification function
def classify_emotion(text):
    # Tokenize and clean the text
    words = word_tokenize(text.lower())  # Convert to lowercase
    words = [word for word in words if word.isalnum()]  # Remove punctuation
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]  # Remove stopwords

    # Initialize emotion scores
    emotion_scores = {emotion: 0 for emotion in emotion_lexicon}

    # Count matching words for each emotion
    for word in words:
        for emotion, emotion_words in emotion_lexicon.items():
            if word in emotion_words:
                emotion_scores[emotion] += 1

    # Determine the dominant emotion (or 'none' if all are 0)
    if all(score == 0 for score in emotion_scores.values()):
        return 'none', 0
    else:
        max_emotion = max(emotion_scores, key=emotion_scores.get)
        return max_emotion, emotion_scores[max_emotion]

# Apply the function to the DataFrame
combined_tweets_df[['emotion', 'emotion_score']] = combined_tweets_df['clean_text'].apply(
    lambda text: pd.Series(classify_emotion(text))
)

# Display results
print(combined_tweets_df[['clean_text', 'emotion', 'emotion_score']].head(50))


To understand the emotional tone behind stock-related tweets, a custom emotion lexicon was created with categories like **joy, anger, sadness, fear, surprise, and trust**. Each tweet is analyzed by matching keywords to these categories using tokenization and filtering techniques.

The function then classifies each tweet by its dominant emotion, assigning a corresponding score based on keyword matches. This helps uncover not just sentiment, but the underlying emotions driving public opinion around the stock market.

### **Visualizing the Tweet Counts by Emotion (Joy, Anger, Sadness, Fear, Surprise, Trust)**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

emotion_counts = combined_tweets_df['emotion'].value_counts()
sns.barplot(x=emotion_counts.index, y=emotion_counts.values, palette='viridis')
plt.title('Distribution of Emotions in Stock Market Tweets')
plt.xlabel('Emotion')
plt.ylabel('Number of Tweets')
plt.xticks(rotation=45)
plt.show()


**Interpretation: Distribution of Emotions in Stock Market Tweets**

The majority of tweets are labeled with **"none"**, meaning they did not contain strong emotional keywords from the defined lexicon. This is common in financial tweets, which are often factual or analytical.

Among the emotionally tagged tweets:

- **Sadness** is the most common emotion, indicating concern about losses, crashes, or downtrends.

- **Joy** follows, reflecting optimism about gains, rallies, or bullish trends.

Emotions like **surprise, fear, anger, and trust** are less frequent but still present, showing a diverse emotional reaction to stock market events.

Overall, this suggests that while most tweets are **neutral** in tone, there's a clear emotional pull toward **sadness and joy**, driven by market performance and public sentiment.

### **Visualizing Popular Stock Names (Cashtags and Hashtags)**

### **Extracting Stock Mentions from Tweets**



In [None]:
import re
from collections import Counter

# Combine all tweets into one big string
all_text = ' '.join(combined_tweets_df['clean_text'].tolist())

# Use regex to extract all words starting with $ or #
stock_mentions = re.findall(r'[\$#][A-Za-z0-9]+', all_text)

# Count the frequency of each mention
mention_counts = Counter(stock_mentions)

# Convert to a DataFrame for plotting
mention_df = pd.DataFrame(mention_counts.items(), columns=['symbol', 'count'])

# Top 100
mention_df = mention_df.sort_values(by='count', ascending=False).head(100)


Using regular expressions, tweets were scanned to identify stock-related mentions marked with **`$` (e.g., $AAPL) or `#` (e.g., #Tesla)**. The top 100 most frequently mentioned tickers and hashtags were extracted and counted, providing insights into which stocks or market topics are getting the most attention on Twitter.

### **Top Mentioned Stock Names (Bar Chart)**


In [None]:
# Vertical Bar Chart
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 20))
sns.barplot(data=mention_df, x='count', y= 'symbol', palette='viridis')
plt.title('Top Mentioned Stock Symbols on Twitter')
plt.xlabel('Frequency')
plt.ylabel('Stock Symbol')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

This horizontal bar chart displays the **most frequently mentioned stock-related hashtags and symbols** from Twitter data.

- **#stockmarket, #tariffs, and #stocks** top the list, indicating heavy discussion around broad market movements and economic policies.

- Other high-frequency mentions include **#stockmarketcrash, #investing,** and popular tickers like **#bitcoin, #tesla, #gold, #eth, #nasdaq and #aapl**.

This visualization helps identify trending topics and the most discussed companies or assets, providing valuable context for sentiment and market analysis.

In [None]:
# Horizontal Bar Chart
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,6))
sns.barplot(data=mention_df, x='symbol', y='count', palette='viridis')
plt.title('Top Mentioned Stock Symbols on Twitter')
plt.xlabel('Stock Symbol')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

### **Word Cloud of Stock Mentions**

In [None]:
from wordcloud import WordCloud

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(mention_counts)

plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Popular Stock Mentions (# or $ symbols)')
plt.show()


The presence of both company-specific symbols **(like btc, $tsla, $aapl)** and broad market terms **(like #stockmarketcrash, #crypto, #daytrading)** shows a mix of general market sentiment and focus on individual assets.

This visualization gives a quick snapshot of where public attention is focused in the financial world on social media.



### Synthetic Tweets Using GPT-2 Model

In [None]:
import torch  # Required for PyTorch backend
from transformers import pipeline # For text generation model

# Set device (0 for GPU, -1 for CPU)
device = 0

# Load the MetaIX/GPT4-X-Alpasta-30b model
generator = pipeline("text-generation", model = "gpt2", device = device)
# generator = pipeline("text-generation", model="distilgpt2", device=device)
# generator = pipeline("text-generation", model="MetaIX/GPT4-X-Alpasta-30b", device=device)


def generate_tweets(prompt, num_tweets = 5):
    return generator(prompt, max_length=50, num_return_sequences = num_tweets)

prompt = "Generate a tweet about stocks and SPY ETF"

synthetic_tweets = generate_tweets(prompt, 1000)

# Print Generated Tweets
for tweet in synthetic_tweets:
    print(tweet["generated_text"])

### **Save to a .csv file using pandas**

In [None]:
import pandas as pd

# Convert to DataFrame
tweet_texts = [tweet["generated_text"] for tweet in synthetic_tweets]
df = pd.DataFrame(tweet_texts, columns=["tweet"])

# Save to CSV
df.to_csv("synthetic_tweets.csv", index=False, encoding="utf-8")


In [None]:
import pandas as pd
import re

# Step 1: Load the CSV file
synthetic_tweets_df = pd.read_csv("synthetic_tweets.csv")  # make sure the file is uploaded
synthetic_tweets_df.head(20)

In [None]:
import re
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import re

# Step 2: Define the cleaning function
def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)  # Remove URLs
    text = re.sub(r'@\w+', "", text)  # Remove mentions
    text = re.sub(r'#', "", text)  # Remove hashtag symbols (keep the word)
    text = re.sub(r'[^a-z\s]', "", text)  # Remove punctuation and numbers
    text = re.sub(r'\s+', " ", text).strip()  # Remove extra spaces
    return text

# Step 3: Apply cleaning to tweet column
synthetic_tweets_df['cleaned_tweet'] = synthetic_tweets_df['tweet'].astype(str).apply(clean_text)

# Step 4: Save or view cleaned tweets
synthetic_tweets_df.to_csv("cleaned_synthetic_tweets.csv", index=False)
synthetic_tweets_df[['tweet', 'cleaned_tweet']].head()
