In this project, we are predicting trends in technology adoption and interest based on social media (Twitter) data. Specifically, the model aims to forecast the following:

1. **Volume of Discussions**: Predicting the number of tweets or social media posts related to specific technologies, gadgets, or software within a given time frame in the future (e.g., daily, weekly). This serves as an indicator of public interest and awareness levels.

2. **Sentiment Trends**: Forecasting the overall sentiment (positive, negative, neutral) associated with these technologies in the social media discourse. This could involve predicting the average sentiment score or the proportion of tweets falling into each sentiment category for upcoming days.

3. **Combination of Volume and Sentiment**: A more comprehensive approach might involve predicting both the volume of discussion and the sentiment concurrently. This dual prediction can provide a more nuanced understanding of how public interest and perception might evolve over time.

### Example Predictions
- **Before a Product Launch**: If there's an upcoming release of a new gadget, the model might predict an increase in the volume of discussion and potentially the sentiment trend leading up to and following the launch.
- **Emerging Technology Trends**: For emerging tech like augmented reality, blockchain, or new software platforms, the model could forecast how discussions (both in volume and sentiment) about these technologies will trend in the short-term future.

### Purpose of These Predictions
- **Market Insight**: These predictions can provide valuable insights for businesses, marketers, and technologists about consumer interest and sentiment trends, aiding in strategic planning and decision-making.
- **Product Strategy**: For tech companies, understanding how public interest and sentiment are likely to shift can inform product development, marketing strategies, and customer engagement plans.
- **Investment Decisions**: Investors in technology sectors might use these predictions to gauge potential market reactions to new technologies or products.

The predictions, therefore, are not just about the raw data but also about interpreting the data to extract meaningful trends and insights that can inform various strategic decisions in the technology domain.

In [11]:
# For web scraping
import requests
import selenium
from selenium import webdriver
import tweepy

# Data Storage
import sqlite3

# For data manipulation
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# For advanced data manipulation
from scipy import stats

# For machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# For working with APIs
import json

# For datetime operations
from datetime import datetime

# Additional utilities
import os
import re
import json
import time
import zipfile

# For time series analysis
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

# For real-time data streaming
import websocket
import socket

# For asynchronous programming (useful for real-time data processing)
import asyncio
from datetime import datetime

# Data Collection

## 2.1 Twitter API Access
Apply for Access: If you haven't already, apply for a Twitter Developer account at developer.twitter.com. You'll need to explain your project's purpose and how you'll use the data.
Create an Application: Once approved, create a new application in the Twitter Developer portal to get your API keys and tokens — these are necessary for accessing the Twitter API.

## 2.2 Understand Twitter API Limitations
Rate Limits: Familiarize yourself with the Twitter API rate limits to avoid hitting the cap on the number of requests.
Data Availability: Twitter API provides access to tweets from the last 7 days for the standard search API, and more historical data with the premium or enterprise tiers.

## 2.3 Develop Data Collection Script
Install Tweepy: Use the Python library tweepy for interacting with the Twitter API. Install it via pip (pip install tweepy).
Authentication: Use your API keys and tokens to authenticate your requests.
Querying Tweets: Write a function to query tweets based on the product or brand name inputted by the user. Use query parameters effectively to filter and retrieve relevant tweets.
Handling Rate Limiting: Implement logic to handle rate limiting by the Twitter API, such as waiting and retrying after a certain period.

## 2.4 Store Collected Data
Temporary Storage: Initially, you might store the tweets in a temporary data structure like a list or a Pandas DataFrame.
Database Storage: For long-term storage and retrieval, consider saving the tweets to a database. Choose between SQL or NoSQL based on your preference and the data's structure.

## 2.5 Error Handling
API Errors: Implement error handling for issues like network errors, API rate limits, or invalid responses.
Data Quality Checks: Put checks in place to ensure the quality of the data collected (e.g., filtering out irrelevant or spammy tweets).

In [15]:
# Twitter API keys
api_key = 'F0qKO41dErn04DpsRuAtnnSaT'
api_secret_key = 'TKTT695N6shmAVxsSVzVRGUF9CSKLqoIPrkeHLDHHfj5UaNHUv'
bearer_token = 'AAAAAAAAAAAAAAAAAAAAAEPSjQEAAAAA3DQLgk5ybCdfGUtqI%2FKv4SruAHY%3DhPf2TpZzegGcm4L3ExHFmATJmXl5VECRIHJPxhfZwxpuYTsf4U'
access_token = '2931998159-oEfo3wO1SsEkil6NJ1T3Wni7lvdciTKLIvNeUz3'
access_token_secret = 'Pu7kueCRteEwU28vzqpsCh0Y0AQ9y0wIqW8VssrZUoDDN'
client_id = 'cl9xZUpDZE9Bb01aZUdIWWQ3aFM6MTpjaQ'
client_id_secret = 'qHcKpBGB1YLgIQdRfcgMf4YCBzZpYy_OQlkf67mE_afJ1T2C3l'

# Authenticate
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)

In [16]:
# Authenticate to Twitter
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)

# Create API object
api = tweepy.API(auth, wait_on_rate_limit=True,
                 wait_on_rate_limit_notify=True)

# Define a class to listen to tweets
class MyStreamListener(tweepy.StreamListener):
    def on_status(self, status):
        # Extract info from tweets
        if status.retweeted:
            return
        text = status.text
        sentiment_score = analyze_sentiment(text)
        # Process and store the tweet and sentiment score

# Instantiate the stream listener
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

# Start the stream
# Use Tweepy to stream tweets containing keywords like 
# "artificial intelligence", "augmented reality", "blockchain".
myStream.filter(track=['artificial intelligence', 'augmented reality', 'blockchain'])


TypeError: API.__init__() got an unexpected keyword argument 'wait_on_rate_limit_notify'

In [13]:
def search_tweets(query, max_tweets):
    tweets = tweepy.Cursor(api.search_tweets, q=query, lang="en", tweet_mode='extended').items(max_tweets)
    
    tweet_list = []
    for tweet in tweets:
        tweet_list.append(tweet.full_text)

    return tweet_list

product_name = "ChatGPT"
max_tweets = 50
tweets_about_product = search_tweets(product_name, max_tweets)

# Print the fetched tweets
for tweet in tweets_about_product:
    print(tweet)
    store_tweet(tweet, product_name)

Forbidden: 403 Forbidden
453 - You currently have access to a subset of Twitter API v2 endpoints and limited v1.1 endpoints (e.g. media post, oauth) only. If you need access to this endpoint, you may need a different access level. You can learn more here: https://developer.twitter.com/en/portal/product

In [None]:
# Storing the tweets in a database

def create_database():
    # Connect to SQLite database (it will be created if it doesn't exist)
    conn = sqlite3.connect('twitter_data.db')

    # Create a new SQLite table with columns for different tweet attributes
    conn.execute('''CREATE TABLE IF NOT EXISTS tweets
                 (id INTEGER PRIMARY KEY AUTOINCREMENT,
                  tweet_text TEXT,
                  query TEXT,
                  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')
    
    # Commit changes and close the connection
    conn.commit()
    conn.close()

create_database()

def store_tweet(tweet_text, query):
    conn = sqlite3.connect('twitter_data.db')
    cur = conn.cursor()

    # Insert a new row of data
    cur.execute("INSERT INTO tweets (tweet_text, query) VALUES (?, ?)", (tweet_text, query))

    # Commit changes and close the connection
    conn.commit()
    conn.close()
    
# For fetching the data in the database later
def get_tweets_by_query(query):
    conn = sqlite3.connect('twitter_data.db')
    cur = conn.cursor()

    # Select tweets that match the query
    cur.execute("SELECT tweet_text FROM tweets WHERE query=?", (query,))
    all_tweets = cur.fetchall()

    conn.close()
    return all_tweets


# Step 3: Data Processing
3.1 Clean and Preprocess Data
Implement functions to clean tweets (removing URLs, mentions, hashtags, and special characters).
Normalize text data (like converting to lowercase, removing punctuation).

Remove URLs: URLs in tweets can be removed as they usually don't contribute to sentiment analysis.
Remove Mentions and Hashtags: Mentions (@usernames) and hashtags (#hashtag) can also be removed or kept based on your analysis requirement.
Remove Special Characters and Numbers: Special characters and numbers often don't contribute to sentiment analysis and can be removed.
Convert to Lowercase: Convert all texts to lowercase to maintain consistency.

3.2 Data Storage
Decide on how you'll store the fetched tweets (e.g., in a database or files).
Implement the storage mechanism in your script.

In [None]:
import re

def clean_tweet(tweet):
    tweet = re.sub(r'http\S+', '', tweet)  # Remove URLs
    tweet = re.sub(r'@\S+', '', tweet)  # Remove mentions
    tweet = re.sub(r'#\S+', '', tweet)  # Remove hashtags
    tweet = re.sub(r'[^A-Za-z\s]', '', tweet)  # Remove special characters and numbers
    tweet = tweet.lower()  # Convert to lowercase
    return tweet

cleaned_tweets = clean_tweet(tweets_about_product)

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def lemmatize_tweet(tweet):
    words = tweet.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

def remove_stopwords(tweet):
    words = tweet.split()
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

def preprocess_tweet(tweet):
    tweet = clean_tweet(tweet)
    tweet = lemmatize_tweet(tweet)
    tweet = remove_stopwords(tweet)
    return tweet

# Step 4: Data Aggregation for Time Series
**Aggregate Data:** Aggregate the data by your chosen time intervals (e.g., daily). You’ll want to sum up the number of tweets and calculate average sentiment scores for each interval.

**Exploratory Data Analysis Trend Analysis:** Visualize the volume of tweets and average sentiment over time to identify patterns, trends, and anomalies. Correlation Analysis: Optionally, check if there's any correlation between the volume of tweets or sentiment with external events or announcements in the tech world.

**Time-Series Forecasting Model Selection:** Choose a suitable model for time-series forecasting. ARIMA, SARIMA, and LSTM neural networks are common choices. Feature Engineering: Include relevant time-based features and any other features that might improve the model. Model Training: Train your model on the historical data.

**Model Evaluation Validation:** Validate your model on a separate test set. Performance Metrics: Use metrics like MAE, RMSE, or others relevant for time-series to evaluate the model's performance.

In [None]:
# Convert to DataFrame
df = pd.DataFrame(tweets_about_product)

# Convert timestamp to datetime and set as index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

# Resample and aggregate
aggregated_df = df.resample('D').agg({'text': 'count', 'sentiment': 'mean'}) # 'D' for daily
aggregated_df.rename(columns={'text': 'tweet_count'}, inplace=True)

# Now, aggregated_df contains daily tweet counts and average sentiment
print(aggregated_df.head())

In [None]:
# filling missing values with the previous day's data (forward fill)
# It contains columns like 'tweet_count' and 'average_sentiment'

# Check for missing values
print("Missing values before forward fill:")
print(aggregated_df.isnull().sum())

# Apply forward fill
aggregated_df.ffill(inplace=True)

# Check to ensure missing values are filled
print("Missing values after forward fill:")
print(aggregated_df.isnull().sum())

In [None]:
# Using ARIMA (Example)
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA

# Assuming df is your DataFrame and 'tweet_count' is the column
# Check for stationarity
result = adfuller(df['tweet_count'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])

# Fit ARIMA model (example parameters)
model = ARIMA(df['tweet_count'], order=(1,1,1))
model_fit = model.fit()
print(model_fit.summary())


In [None]:
# Using Vector Autoregression (Example)
from statsmodels.tsa.api import VAR

# Assuming df is your DataFrame with multiple columns like 'tweet_count' and 'sentiment'
model = VAR(df)
model_fit = model.fit(maxlags=15, ic='aic')
print(model_fit.summary())


# Step 5: Exploratory Data Analysis (EDA)
**Trend Analysis:** Visualize the volume of tweets and average sentiment over time to identify patterns, trends, and anomalies.

**Correlation Analysis:** Optionally, check if there's any correlation between the volume of tweets or sentiment with external events or announcements in the tech world.

In [None]:
import matplotlib.pyplot as plt

# Assuming 'aggregated_df' has columns 'tweet_count' and 'average_sentiment'
aggregated_df.plot(subplots=True)
plt.show()

In [None]:
from statsmodels.tsa.stattools import adfuller

# Augmented Dickey-Fuller test
result = adfuller(aggregated_df['tweet_count'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(aggregated_df['tweet_count'])
plot_pacf(aggregated_df['tweet_count'])
plt.show()


In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(aggregated_df['tweet_count'], model='additive')
decomposition.plot()
plt.show()


In [None]:
import seaborn as sns

sns.heatmap(aggregated_df.corr(), annot=True)
plt.show()


# Step 6: Time-Series Forecasting
**Model Selection:** Choose a suitable model for time-series forecasting. ARIMA, SARIMA, and LSTM neural networks are common choices.

**Feature Engineering:** Include relevant time-based features and any other features that might improve the model.
Model Training: Train your model on the historical data.

In [None]:
from statsmodels.tsa.arima.model import ARIMA

# Example: ARIMA with parameters (1,1,1)
model = ARIMA(train_data, order=(1,1,1))
model_fit = model.fit()

In [None]:
from statsmodels.tsa.api import VAR

model = VAR(train_data)
model_fit = model.fit(ic='aic')  # 'aic' will choose the best lag order based on AIC

# Step 7: Model Evaluation
**Validation:** Validate your model on a separate test set.

**Performance Metrics:** Use metrics like MAE, RMSE, or others relevant for time-series to evaluate the model's performance.# 

# Step 8: Prediction and Visualization
**Make Predictions:** Use the model to make predictions about future trends in discussions on emerging technologies.

**Visualization:** Create visualizations to represent the predicted trends in an understandable and insightful way.