# PrepareData.ipynb

## Read data from downloaded csv files and create dataframes.

NOTE: Company_Tweet.csv & Tweet.csv are very large data files could not upload them in git as account limit is max 100MB per file

Below are the steps followed to read data and store dataframes:
* To run this code download Resources Company_Tweet.csv & Tweet.csv from kaggle link https://www.kaggle.com/code/saadusama/twitter-s-impact-on-stock-market-prices/data and copy them in Resources folder
* Read tickers data from Resources/CompanyValues.csv, filter Tesla stock data and store it in a dataframe
* Read Twitter data from Company_Tweet.csv & Tweet.csv, filter tweets for Tesla and store in a dataframe

### Necessary imports

In [None]:
import pandas as pd
from pathlib import Path
from datetime import datetime

### Read Ticker data
* Read stock data from Resources/CompanyValues.csv
* Filter dataframe to store only TSLA data
* Drop ticker_symbol column as it is not required anymore
* set index to day_date
* Review DataFrame

In [None]:
market_df = pd.read_csv(Path("Resources/CompanyValues.csv"))
tsla_stock_values_df = market_df[market_df["ticker_symbol"] == "TSLA"]
tsla_stock_values_df = tsla_stock_values_df.drop(["ticker_symbol"],axis=1)
tsla_stock_values_df.set_index("day_date",inplace=True)
tsla_stock_values_df.head()

### Read Twitter Data and prepare one DataFrame for TSLA tweets
* Read Tweets from Resources/Tweet.csv and review dataframe
* Read Resources/Company_Tweet.csv, to find tweets relevant for TSLA, and review dataframe
* Merge both dataframes on tweet_id to get the consolidated tweet data for TSLA
* Review merged dataframe
* Convert post_date to Datetime format

In [None]:
tweets_df = pd.read_csv(Path("Resources/Tweet.csv"))
display(tweets_df.head())
display(tweets_df.shape)

In [None]:
company_tweets_df = pd.read_csv(Path("Resources/Company_Tweet.csv"))
display(company_tweets_df.head())
display(company_tweets_df.shape)

In [None]:
tsla_tweets_df = company_tweets_df[company_tweets_df["ticker_symbol"] == "TSLA"]
display(tsla_tweets_df.head())
display(tsla_tweets_df.shape)

In [None]:
tsla_tweets_df = pd.merge(tsla_tweets_df,tweets_df,on="tweet_id")
display(tsla_tweets_df.head())
display(tsla_tweets_df.tail())

In [None]:
tsla_tweets_df.info()
tsla_tweets_df.shape

In [None]:
# tsla_tweets_df = tsla_tweets_df[tsla_tweets_df["retweet_num"] > 0]
tsla_tweets_df["total_engagement"] = tsla_tweets_df["comment_num"] + tsla_tweets_df["retweet_num"] + tsla_tweets_df["like_num"]
tsla_tweets_df = tsla_tweets_df[tsla_tweets_df["total_engagement"] > 2]
tsla_tweets_df.shape

In [None]:
tsla_tweets_df.post_date=tsla_tweets_df.post_date.apply(lambda z:datetime.fromtimestamp(z))
tsla_tweets_df.head()

In [None]:
tsla_tweets_df.to_csv("Resources/tsla_tweets.csv")

### Store dataframes in IPython's Database
This will help reuse the dataframes, without repeating the code.
* %store - stores variables, aliases and macros in IPython’s database.
* store TSLA tweets dataframe and stock market data in IPython's database

In [None]:
%store tsla_tweets_df
%store tsla_stock_values_df