# TikTok Political Analysis
## ~Objectives
### Problems & Questions
__How can we better develop educational materials to meet kids where they are?__
- is it worth it to spend money to advertise to youth for political campaigns - are they engaging with current events?
- what are kids talking about & why? What does our education system tell them and not tell them

### Goals
- understanding how age/youth impacts political indoctrination
- understanding social impacts of political events
- to understand colloquial knowledge of political concepts

## ~Scope
- daily batch updates
- parsed news events triggers TikTok & twitter queries 
- topic counts 3 days before event cumulatively added to event day & 3 days following event
- see trend lines of engagement on Twitter & TikTok

### Overview:
- Use NewsAPI to find top news by day
- Parse news story title & article into individual words/phrases
- Count most important individual words & phrases
- Use top 3 most important words & phrases to create rules for searching the Twitter API
- Count number of tweets mentioning words & phrases filtered by rules
- Use top 3 words & phrases to find similar tags on TikTok API
- Count number of TikTok challenges/tags/captions with top words & phrases

## ~Extras
- age inference of users
- sentiment analysis (TextBlob)

---

# 1. Install Dependencies & Import Modules
- Newsapi-python: pip install newsapi-python
- Tweepy (install without virtual environment): pip install tweepy
- playwright: pip install playwright
                playwright install
- TikTokAPI (install without virtual environment): pip install PyTikTokAPI


In [8]:
import pandas as pd
import json
import requests
from datetime import date, timedelta
from bs4 import BeautifulSoup
import numpy as np
import logging
import configparser
from timer import Timer
from numpy import datetime64
from datetime import date, datetime, timedelta
from nltk import tokenize
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from operator import itemgetter
import math
import tweepy  # python package for accessing Tweet streaming API
from tweepy import API
from tweepy import Stream
import urllib.parse
import psycopg2  # alts: SQLalchemy - warning: not as simple
from TikTokAPI import TikTokAPI
from selenium import webdriver
from psycopg2 import Error
import re
import sys
import geocoder
from helper_functions import *


Configure using config.ini file

In [9]:
c = configparser.ConfigParser()
c.read('config.ini')

# config credentials
host = c['database']['host']
username = c['database']['user']
password = c['database']['password']
db = c['database']['database']

news_api_key = c['newsAuth']['api_key']
tiktok_sv_id = c['tiktokAuth']['s_v_web_id']
tiktok_tt_id = c['tiktokAuth']['tt_webid']
# twitter_api_key = c['twitterAuth']['api_key']

access_token = c['twitterAuth']['access_token']
access_token_secret = c['twitterAuth']['access_token_secret']
consumer_key = c['twitterAuth']['consumer_key']
consumer_secret = c['twitterAuth']['consumer_secret']


create Database class

In [10]:
class DataBase():
    def __init__(self, host_name, user_name, user_password):
        self.host_name = host_name
        self.user_name = user_name
        self.user_password = user_password

    def create_server_connection(self):
        self.connection = None
        try:
            self.connection = psycopg2.connect(
                host=self.host_name,
                user=self.user_name,
                password=self.user_password
            )
            logging.info("Database connection successful")
        except Error as err:
            logging.error(f"Error: '{err}'")

        return self.connection


    def create_database(self, connection, query):
            self.connection = connection
            cursor = connection.cursor()
            try:
                cursor.execute(query)
                logging.info("Database created successfully")
            except Error as err:
                logging.error(f"Error: '{err}'")


    def create_db_connection(self, db_name):
            self.db_name = db_name
            self.connection = None
            try:
                self.connection = psycopg2.connect(
                    host=self.host_name,
                    user=self.user_name,
                    password=self.user_password,
                    database=self.db_name
                )
                # cursor = connection.cursor()
                logging.info("PostgreSQL Database connection successful")
            except psycopg2.Error as err:
                logging.error(f"Error: '{err}'")

            return self.connection

    # @Timer(name='Query Execution') #*TODO fix __enter__ attribute error
    def execute_query(self, connection, query):
            self.connection = connection
            cursor = connection.cursor()
            try:
                cursor.execute(query)
                self.connection.commit()
                logging.info("Query successful")
            except Error as err:
                print(f"Error: '{err}'")
    
    def read_query(self, connection, query):
        self.connection = connection
        cursor = self.connection.cursor()
        result = None
        try:
            cursor.execute(query)
            result = cursor.fetchall()
            return result
        except Error as err:
            logging.error(f"Error: '{err}'")
    

    # @Timer(name='Mogrify')
    def execute_mogrify(self, conn, df, table):
        """
        Using cursor.mogrify() to build the bulk insert query
        then cursor.execute() to execute the query
        """
        self.connection = conn
        # Create a list of tupples from the dataframe values
        tuples = [tuple(x) for x in df.to_numpy()]
    
        # Comma-separated dataframe columns
        cols = ','.join(list(df.columns))
    
        # SQL query to execute
        cursor = conn.cursor()
        values = [cursor.mogrify("(%s,%s,%s,%s)", tup).decode('utf8')
                for tup in tuples]
        # if not publishedAt, delete record
        query = "INSERT INTO %s(%s) VALUES" % (table, cols) + ",".join(values)

        try:
            cursor.execute(query, tuples)
            conn.commit()
        except (Exception, psycopg2.DatabaseError) as error:
            logging.error("Error: %s" % error)
            print("Error: %s" % error)
            conn.rollback()
            cursor.close()
            conn.close()
            return 1
        logging.info("execute_mogrify() done")
        cursor.close()
        conn.close()


Variables for SQL queries

In [None]:
# create db 
create_database_query = """
        CREATE DATABASE IF NOT EXISTS sm_news;
    """
# create necessary tables
create_article_table = """
    CREATE TABLE IF NOT EXISTS articles (
        publishedAt DATE,
        title VARCHAR PRIMARY KEY,
        author VARCHAR,
        url TEXT
        );
    """
create_article_table_index = """
    CREATE INDEX index
        ON articles(publishedAt,
            title
        );
    """
create_article_text_table = """
    CREATE TABLE IF NOT EXISTS article_text (
        title VARCHAR PRIMARY KEY,
        article_text TEXT
        );
    """
create_political_event_table = """
    CREATE TABLE IF NOT EXISTS event (
        eventID ID PRIMARY KEY,
        startDate DATE,
        name VARCHAR NOT NULL,
        description VARCHAR NOT NULL,
        keyWords VARCHAR
        );
 """
create_tweets_table = """
    CREATE TABLE IF NOT EXISTS tweets (
        tweet_id INT PRIMARY KEY,
        publishedAt DATE NOT NULL,
        userID VARCHAR NOT NULL,
        tweet VARCHAR NOT NULL,
        location VARCHAR NOT NULL, 
        tags VARCHAR NOT NULL
        );
    """
create_tiktoks_table = """
    CREATE TABLE IF NOT EXISTS tiktoks (
        postID INT PRIMARY KEY,
        createTime DATE NOT NULL,
        userID INT NOT NULL,
        description VARCHAR NOT NULL,
        musicID INT NOT NULL,
        soundID INT NOT NULL,
        tags VARCHAR NOT NULL
        );
    """
create_tiktok_sounds_table = """
    CREATE TABLE IF NOT EXISTS tiktok_sounds (
        soundID INT PRIMARY KEY,
        soundTitle VARCHAR,
        isOriginal BOOLEAN
        );
    """
create_tiktok_music_table = """
    CREATE TABLE IF NOT EXISTS tiktok_music (
        songID INT PRIMARY KEY,
        songTitle VARCHAR NOT NULL
        );
    """

create_tiktok_stats_table = """
    CREATE TABLE IF NOT EXISTS tiktok_stats (
        postID INT PRIMARY KEY,
        shareCount INT,
        commentCount INT,
        playCount INT,
        diggCount INT
        );
    """

create_tiktok_tags_table = """
    CREATE TABLE IF NOT EXISTS tiktok_tags (
        tagID INT PRIMARY KEY,
        tag_name VARCHAR NOT NULL 
        );
    """
create_users_table = """
    CREATE TABLE IF NOT EXISTS users (
        userID INT PRIMARY KEY,
        username VARCHAR NOT NULL,
        user_bio VARCHAR NOT NULL
        );
    """
delete_bad_data = """
    DELETE FROM articles
        WHERE publishedAt IS NULL;
    """


Create Database

In [11]:
postgres_db = DataBase(host, username, password)

# connect to server
postgres_server = postgres_db.create_server_connection()

# connect to social media news db
connection = postgres_db.create_db_connection(db)




In [None]:
# execute defined queries to create db tables if needed
try:
    # TODO fix attribute error __enter__ for Timer wrapper
    postgres_db.execute_query(connection, create_article_table)
    postgres_db.execute_query(connection, create_article_text_table)
    postgres_db.execute_query(connection, create_tweets_table)
    postgres_db.execute_query(connection, create_political_event_table)
    postgres_db.execute_query(connection, create_users_table)
    postgres_db.execute_query(connection, create_tiktok_sounds_table)
    postgres_db.execute_query(connection, create_tiktok_music_table)
    postgres_db.execute_query(connection, create_tiktok_stats_table)
    postgres_db.execute_query(connection, create_tiktok_tags_table)
    postgres_db.execute_query(connection, create_tiktoks_table)
except (ConnectionError) as e:
    logging.error({e}, 'Check SQL create queries')


In [None]:
# add foreign keys
alter_tiktoks_table = """
    ALTER TABLE tiktoks
    ADD FOREIGN KEY(musicID) REFERENCES tiktok_music(songID),
    ADD FOREIGN KEY(soundID) REFERENCES tiktok_sounds(soundID),
    ADD FOREIGN KEY(userID) REFERENCES users(userID)
    ON DELETE SET NULL;
"""
alter_tiktok_stats_table = """
    ALTER TABLE tiktok_stats
    ADD FOREIGN KEY(postID) REFERENCES tiktoks(postID)
    ON DELETE SET NULL;
"""
try:
    postgres_db.execute_query(connection, alter_tiktoks_table)
    postgres_db.execute_query(connection, alter_tiktok_stats_table)
except (ConnectionError) as e:
    logging.error({e}, 'Check SQL alteration queries. Are the foreign key restraints valid?')


# 2. Find Top News by Day

Create News class

In [12]:
class News():
    """Extract keywords from  news articles to use as search values for TikTok & Twitter posts relating to the political event of interest. """

    def __init__(self, api_key, logger=logging):
        self.api_key = api_key
        self.logger = logging.basicConfig(filename='news.log', filemode='w',
                    format=f'%(asctime)s - %(levelname)s - %(message)s')

    def request_pop_news(self, params={
        'q': ['politics' or 'political' or 'law' or 'legal' or 'policy'],
        'from': {date.today() - timedelta(days=3)},
        'to': {date.today},
        'language': 'en',
        'sort_by': 'popularity'
    }):
        pop_news = []
        self.params = params

        headers = {
            'X-Api-Key': self.api_key,
            # get_random_ua for Chrome
            'user-agent': get_random_ua('Chrome')
        }

        url = 'https://newsapi.org/v2/everything'

        try:
            # response as JSON dict
            self.response = requests.get(
                url, params=self.params, headers=headers).json()  # backoff_factor=1, verify=False
        except requests.ConnectionError as error:
            logging.error(f'Connection error: {error}\n Likely too many requests. Try using a proxy or changing user agents.')
        else:
            with open('pop_news.json', 'w') as f:
                # write results to JSON file
                json.dump(self.response, f)

            with open('pop_news.json', 'r') as file:
                # create Python list object from JSON
                pop_news_json = file.read().split("\n")

                for story in pop_news_json:
                    pop_obj = json.loads(story)

                    if 'title' in pop_obj:
                        pop_obj['title'] = pop_obj['articles']['title']
                    if 'author' in pop_obj:
                        pop_obj['author'] = pop_obj['articles']['author']
                    if 'url' in pop_obj:
                        pop_obj['url'] = pop_obj['articles']['url']
                    if 'publishedAt' in pop_obj:
                        pop_obj['publishedAt'] = pop_obj['articles']['publishedAt']

                    # add info to pop_news dict
                    pop_news.append(pop_obj)
                    # self.news_counter += 1
            logging.info('Pop news request successful')
        
        return pop_news


    def get_top_headlines(self, params={
        "language": "en",
        "country": "us"
    }):
        top_headlines = []
        self.params = params
        # self.news_counter = 0

        headers = {
            "X-Api-Key": news_api_key,
            "user-agent": get_random_ua('Chrome')
            }
        url = "https://newsapi.org/v2/top-headlines"

        try:  # backoff_factor = 1 so successive sleep between failed requests are 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256
            self.response = requests.get(
                    url, params=self.params, headers=headers).json()  # response JSON dict
        except requests.exceptions.ConnectionError:
            self.response.status_code = "Connection refused"
                # break
        else:
            with open("top_headlines.json", "w") as f:
                    # write results to JSON file
                json.dump(self.response, f)

            with open("top_headlines.json", "r") as file:
                    # create Python object from JSON
                    top_headlines_json = file.read().split("\n")

                    for story in top_headlines_json:
                        story_obj = json.loads(story)

                        if 'title' in story_obj:
                            story_obj["title"] = story_obj["articles"]["title"]
                        if 'author' in story_obj:
                            story_obj["author"] = story_obj["articles"]["author"]
                        if 'url' in story_obj:
                            story_obj["url"] = story_obj["articles"]["url"]
                        if 'publishedAt' in story_obj:
                            story_obj["publishedAt"] = story_obj["articles"]["publishedAt"]

                        # add info to top_headlines list/dict
                        top_headlines.append(story_obj)
                        # self.news_counter += 1
            logging.info('News top headlines request successful')
        
            return top_headlines


    def get_all_news(self):
        """Combines top headlines and popular news into one Pandas DataFrame."""

        # call .get_top_headlines() and .request_pop_news()
        top_headlines = self.get_top_headlines()
        pop_news = self.request_pop_news()
        logging.info('News requests successful')

        # noramlize nested JSON results
        pop_news = pd.json_normalize(pop_news, record_path=['articles'])
        top_headlines = pd.json_normalize(
            top_headlines, record_path=['articles'])
        all_news = top_headlines.append(pop_news)

        # create dataframe from combined news list
        self.all_news_df = pd.DataFrame(
                all_news, columns=['title', 'author', 'url', 'publishedAt', "text"])
        self.all_news_df.drop_duplicates()

        # convert to datetime
        self.all_news_df['publishedAt'] = self.all_news_df['publishedAt'].map(
                lambda row: datetime.strptime(str(row), "%Y-%m-%dT%H:%M:%SZ") if pd.notnull(row) else row)

        # set index to publishing time, inplace to apply to same df instead of copy or view
        self.all_news_df.set_index('publishedAt', inplace=True)

        # apply .get_article_text() to text column of df
        self.all_news_df["text"] = self.all_news_df["url"].apply(
                self.get_article_text)

        # get keywords from article text
        self.all_news_df["keywords"] = self.all_news_df['text'].apply(
                self.keyword_extraction)
        # get top n=3 words of significance
        self.all_news_df["keywords"] = self.all_news_df["keywords"].apply(
                self.get_top_n, n=3)

        # print(len(all_news))
        logging.info(f'Stored {len(all_news)} news stories in dataframe')
        
        return self.all_news_df

    
    def get_article_text(self, url):
        """Clean & process news article text to prepare for keyword extraction"""

        contractions_dict = {"'s": " is", "n't": " not", "'m": " am", "'ll": " will",
                             "'d": " would", "'ve": " have", "'re": " are"}
        symbols_list = ['&', '+', '-', '/', '|', '$', '%', ':', '(', ')', '?', "'", ';', ',']

        try:
            # request
            r = requests.get(url)
            html = r.text
            soup = BeautifulSoup(html)
            a_text = soup.get_text()
        except requests.RequestException as ex:
            logging.exception(ex, 'Issue with article text requests')
        else:
            # remove newline characters
            a_text = a_text.strip()
            # split joined words
            a_text = " ".join([s for s in re.split(
                "([A-Z][a-z]+[^A-Z]*)", a_text) if s])
            # remove mentions
            a_text = re.sub("@\S+", " ", a_text)
            # remove URLs
            a_text = re.sub("https*\S+", " ", a_text)
            # remove hashtags
            a_text = re.sub("#\S+", " ", a_text)
            # remove unicode characters
            a_text = a_text.encode('ascii', 'ignore').decode()
            # replace contractions
            for key, value in contractions_dict.items():
                if key in a_text:
                    a_text = a_text.replace(key, value)
            # remove symbols and punctuation
            for i in symbols_list:
                if i in a_text:
                    a_text = a_text.replace(i, '')

            # make lowercase
            a_text = a_text.lower()
            a_text = re.sub(r'\w*\d+\w*', '', a_text)
        
        return a_text


    def keyword_extraction(self, text):
        """Determine weight of important words in articles and add to articles_text table
        using TF-IDF ranking"""

        # make sure text is in string format for parsing
        text = str(text)
        stop_words = set(stopwords.words('english'))

        # find total words in document for calculating Term Frequency (TF)
        total_words = text.split()
        total_word_length = len(total_words)

        # find total number of sentences for calculating Inverse Document Frequency
        total_sentences = tokenize.sent_tokenize(text)
        total_sent_len = len(total_sentences)

        # calculate TF for each word
        tf_score = {}
        for each_word in total_words:
            each_word = each_word.replace('.', '')
            if each_word not in stop_words and len(each_word) > 3:
                if each_word in tf_score:
                    tf_score[each_word] += 1
                else:
                    tf_score[each_word] = 1

        # Divide by total_word_length for each dictionary element
        tf_score.update((x, y/int(total_word_length))
                        for x, y in tf_score.items())  # TODO test - ZeroError

        #calculate IDF for each word
        idf_score = {}
        for each_word in total_words:
            each_word = each_word.replace('.', '')
            if each_word not in stop_words and len(each_word) > 3:
                if each_word in idf_score:
                    idf_score[each_word] = self.check_sent(each_word, total_sentences)
                else:
                    idf_score[each_word] = 1

        # Performing a log and divide
        idf_score.update((x, math.log(int(total_sent_len)/y))
                        for x, y in idf_score.items())

        # Calculate IDF * TF for each word
        tf_idf_score = {key: tf_score[key] *
                        idf_score.get(key, 0) for key in tf_score.keys()}

        return tf_idf_score

    def check_sent(self, word, sentences):
        """Check if word is present in sentence list for calculating IDF (Inverse Document Frequency)"""
        final = [all([w in x for w in word]) for x in sentences]
        sent_len = [sentences[i] for i in range(0, len(final)) if final[i]]
    
        return int(len(sent_len))

    def get_top_n(self, dict_elem, n):
        """Calculate most important keywords in text of interest"""
        result = dict(sorted(dict_elem.items(),
                     key=itemgetter(1), reverse=True)[:n])
        result = result.keys()

        return result



# 3. Parse Titles & Articles

In [13]:
# instantiate News class
news = News(news_api_key)
# get all news - takes about 30 seconds
news.get_all_news()

Unnamed: 0_level_0,title,author,url,text,keywords
publishedAt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-09-01 21:57:13,"Joe Rogan, A Podcasting Giant Who Has Been Dis...",Alyssa Lukpat,https://www.nytimes.com/2021/09/01/business/jo...,joe rogan a podcasting giant who has be...,"(sections, search, content)"
2021-09-01 21:13:20,Social Security costs to exceed revenue for 1s...,Danielle DuClos,https://abcnews.go.com/Politics/social-securit...,social security costs to exceed revenue for ...,"(majority, security, projected)"
2021-09-01 20:50:36,Syed Ali Geelani: Kashmir's separatist patriar...,https://www.facebook.com/bbcnews,https://www.bbc.com/news/world-asia-india-5841...,syed ali geelani kashmir is separatist patr...,"(kashmir, external, pakistan)"
2021-09-01 20:42:53,Purdue Pharma Is Dissolved and Sacklers Pay $4...,Jan Hoffman,https://www.nytimes.com/2021/09/01/health/purd...,purdue pharma is dissolved and sacklers p...,"(sacklers, bankruptcy, judge)"
2021-09-01 20:29:18,What a Ben Simmons trade would mean for 76ers:...,,https://www.cbssports.com/nba/news/what-a-ben-...,what a ben simmons trade would mean for a ...,"(sixers, jasmyn, quinn)"
2021-09-01 20:27:48,"It’s Election Season in Germany. No Charisma, ...",Katrin Bennhold,https://www.nytimes.com/2021/09/01/world/europ...,its election season in germany. no charis...,"(scholz, please!, merkel)"
2021-09-01 20:15:58,Vaccinated People and Breakthrough Infections:...,Tara Parker-Pope,https://www.nytimes.com/article/delta-breakthr...,what vaccinated people need to know about...,"(people, experts, vaccinated)"
2021-09-01 20:04:38,Biden: Texas abortion law 'blatantly violates'...,Kelly Hooper,https://www.politico.com/news/2021/09/01/biden...,biden texas abortion law blatantly violates ...,"(texas, magazine, skip)"
2021-09-01 19:39:00,NeNe Leakes’ husband Gregg dies of cancer at 6...,Leah Bitsky,https://pagesix.com/2021/09/01/nene-leakes-hus...,ne ne leakes husband gregg dies of cancer at...,"(leakes, click, husband)"
2021-09-01 19:32:08,Psaki dodges questions about Biden pressing Af...,Jessica Chasmar,https://www.foxnews.com/politics/psaki-dodges-...,psaki dodges questions about biden pressing ...,"(news, psaki, biden)"


# 4. Get Important Words & Phrases

# 5. Search Twitter API
## Using Important Words & Phrases

Create Tweets class

In [14]:
class Tweets():

    def __init__(self, consumer_key, consumer_secret, access_token, access_token_secret, logger=logging):
        self.logger = logging.basicConfig(filename='tweets.log', filemode='w',
                                          format=f'%(asctime)s - %(levelname)s - %(message)s')
        self.consumer_key = consumer_key
        self.consumer_secret = consumer_secret
        self.access_token = access_token
        self.access_token_secret = access_token_secret

        self.location = sys.argv[1]  # user location as argument variable
        # object with latitude & longitude of user location
        self.geo = geocoder.osm(self.location)

    def tweepy_auth(self):
        """Authorize tweepy API"""

        self.auth = tweepy.OAuthHandler(self.consumer_key, self.consumer_secret)
        self.auth.set_access_token(self.access_token, self.access_token_secret)

        # create API object
        self.api = API(self.auth, wait_on_rate_limit=True, user_agent=get_random_ua('Chrome'))# wait_on_rate_limit_notify=True)

        try:
            self.api.verify_credentials()
            logging.info("Tweepy API Authenticated")
            print('Tweepy authentication successful')
        except Exception as e:
            logging.error(f"Error during Tweepy authentication: {e}")
            raise e
        return self.api
    
    def get_tweets(self, news_keywords, news_instance): # TODO add stream listening stuff to params
        searched_tweets = self.tweet_search(news_keywords)
        stream_tweets = TwitterStreamListener.on_status(listener, tweet_stream)

        # all_tweets = {}
        # # process tweets
        # for tweet in searched_tweets:
        #     # count tweets
        #     pass
        #     # add count to df column?
            
        # for tweet in stream_tweets:
        #     pass
        # # break tweets apart for table
        # for tweet in searched_tweets, stream_tweets:
        #     all_tweets["tweet_id"] = tweet['id']

        #     # add all tweets to database! via mogrify

        #     # put tweets in df
        #     self.all_tweets_df = pd.DataFrame.from_dict(all_tweets, columns=[
        #                                               "tweet_id", "user_id", "location", "createdAt", "tweet_text"])

        #     self.all_tweets_df.set_index("tweet_id")

        #     # tweets mention count to news df column
        #     news_instance.all_news_df["tweet_mention_count"] = self.all_tweets_df["tweet_id"].apply(
        #         np.count_nonzero)

            # clear dataframe?
    

    def tweet_search(self, news_keywords):
        """Search for tweets within previous 7 days.
                    Inputs:
                        keyword list
                    Returns:
                        Tweet list => JSON
        """
        api = self.api
        self.search_tweet_count = 0

        # unpack keyword tuples
        print('Searching for tweets matching keywords')
        for keys in news_keywords:
            keywords = list(keys)  # TODO add itertools combinations
            for word in keywords:
                try:
                    result = api.search_tweets(q=str(
                                    word) + " -filter:retweets", lang='en')
                                # print(type(result))
                    status = result[0]
                                # print(type(status))
                    tweet = status._json
                    self.search_tweet_count = len(tweet)
                                #self.file.write(json.dumps(tweets)+ '\\n')
                    tweet = json.dumps(tweet)  # tweet to json string
                    # assert (type(tweet) == str), "Tweet must be converted to JSON string"
                    tweet = json.loads(tweet)  # tweet to dict
                    # assert (type(tweet) == dict), "Tweet must be converted from JSON string to type dict"
                except (TypeError) as e:
                    logging('Error: ', e)
                    print('Error: keyword not found in tweet search')
                    print(self.search_tweet_count)
                    break
                else:
                    # write tweets to json file
                    with open("tweets.json", "a") as f:
                        json.dump(tweet, f)
        logging.info('Tweet search successful')
        print(f'Tweet search by keyword was successful. Counted {self.search_tweet_count} tweets.')

        #finally:
        # TODO add tweet unpacking & cleaning?
        #pass
        # TODO put tweets into db
        # TODO
    
    def clean_tweets(self, tweets):
        # use slang.txt
        # https://www.geeksforgeeks.org/python-efficient-text-data-cleaning/
        pass




In [15]:

# news = News(news_api_key)
t = Tweets(consumer_key, consumer_secret,access_token, access_token_secret)
auth = t.tweepy_auth()


Tweepy authentication successful


In [None]:
# define stream listener class
class TwitterStreamListener(tweepy.Stream):
    def __init__(self, api=None):
        super(tweepy.Stream, self).__init__()
        # super(json.JSONEncoder, self).__init__()
        self.consumer_key = consumer_key
        self.consumer_secret = consumer_secret
        self.access_token = access_token
        self.access_token_secret = access_token_secret
        self.api = api
        
        self.num_tweets = 0
        # self.file = open('tweets.txt', 'w')
        self.tweet_list = []
    
    def toJson(self):
        return json.dumps(self, default=lambda o: o.__dict__)

    def on_status(self, status): # returns JSON 
        #print(status[0]._json)

        #tweet = status._json
        #print(status.id)
        # Retweet count
        # retweet_count = status['retweet_count']
        # status = status.toJson()
        # status = json.loads(status)
        # status = str(status)
        

        while self.num_tweets < 450:  # max stream rate is for the twitter API Client
            # for tweet in status:
            #     tweet = tweet._json
            #     print("printing twetets")
            #     print(tweet)
            try:
               # self.filter()
                with open('tweets.json', 'a') as f:
                    # write results to JSON file
                    print("Returning JSON-ish results as string")
                    json.dump(status._json, f)
                    #f.write(tweet)
                    #return True

            except TypeError as e:
                logging('Error: ', e)
                print('{e}: Please convert Stream object')
                continue
            # TODO: except user exit or stream disconnect
            else:
                with open('tweets.json', 'r') as file:
                    # create Python list object from JSON
                    tweets_json = file.read().split("\n")

                    for tweet in tweets_json:
                        print("Deserializing tweets")
                        tweet_obj = json.loads(tweet)

                        # Tweet ID
                        tweet_obj['tweet_id'] = tweet_obj['id']
                        # User ID
                        tweet_obj['user_id'] = tweet_obj['user']['id']
                        # Username
                        tweet_obj['username'] = tweet_obj['user']['name']
                        # creation date
                        tweet_obj['create_time'] = tweet_obj['user']['creation_date']
                        # Language
                        lang = status['lang']

                        # Tweet
                        if status.truncated == True:
                            tweet = tweet_obj['extended_tweet']['full_text']
                            hashtags = tweet_obj['extended_tweet']['entities']['hashtags']
                        else:
                            tweet = status.text
                            hashtags = status.entities['hashtags']

                        # Read hashtags using helper function
                        print("Reading hashtags")
                        hashtags = read_hashtags(hashtags)

                        # add info to pop_news dict
                        # If tweet is not a retweet and tweet is in English
                        if not hasattr(status, "retweeted_status") and lang == "en":
                            print("Adding tweets to list")
                            self.tweet_list.append(tweet_obj)
                    
                        #self.tweet_list.append(status)
                            self.num_tweets += 1
            finally:
                self.disconnect()
                return self.tweet_list
                

            # flatten data to dataframe
            # tweets = pd.json_normalize(self.tweet_list, record_path=['articles'])
        #self.tweets_df = pd.DataFrame(self.tweet_list, columns=[
                                      #"tweet_id", "publishedAt", "userID", "text", "location"])
    
    def on_data(self, data):
        # payload = {}
        data = json.loads(data)
        print(data)

    # Extract hashtags
    def read_hashtags(self, tag_list):
        hashtags = []

        for tag in tag_list:
            hashtags.append(tag['text'])

        return hashtags
    
    def clean_tweets(self):

        with open("tweets.json", "w") as f:
            # write tweets to json file
            json.dump(tweet, f)

        with open("tweets.json", "r") as file:
            # create python object from json
            tweets_json = file.read().split("\n")

            for tweet in tweets_json:
                tweet_obj = json.loads(tweet)

                #flatten nested fields
                if 'quoted_status' in tweet_obj:
                    tweet_obj['quote_tweet'] = tweet_obj['quoted_status']['extended_tweet']['full_text']
                if 'user' in tweet_obj:
                    tweet_obj['location'] = tweet_obj['user']['location']
                # if 'created_at' in tweet_obj:
                #     tweet_obj['created_at'] = pd.to_datetime(tweet)

    def on_error(self, status_code):
        if status_code == 420:
            return False  # false disconnects the stream


In [None]:

auth = t.tweepy_auth()
# instantiate Tweet Stream Listener
listener = TwitterStreamListener()
# authenticate stream
tweet_stream = tweepy.Stream(auth, listener, access_token, access_token_secret) #tweet_mode="extended")
#listener.on_status(tweet_stream)
listener.on_data(tweet_stream)

# print cleaned tweets df

# print(news.all_news_df)


# 6. Search TikTok
## Using Important Words & Phrases

In [43]:
# nest_asyncio.apply()
# __import__('IPython').embed()

c = configparser.ConfigParser()
c.read('config.ini')

host = c['database']['host']
username = c['database']['user']
password = c['database']['password']
db = c['database']['database']

news_api_key = c['newsAuth']['api_key']
tiktok_sv_id = c['tiktokAuth']['s_v_web_id']
tiktok_tt_id = c['tiktokAuth']['tt_webid']

tiktok_auth = {
    "s_v_web_id": tiktok_sv_id,  # references variables saved from config file
    "tt_webid": tiktok_tt_id
}

# sys.setrecursionlimit(10000)

class TikTokAPI(TikTokAPI):

    # create db connection for mogrify
    # DATABASE = DataBase(host, username, password)
    # CXN = DATABASE.create_db_connection(db)

    def __init__(self, cookie, logger=logging, api=None):
        super(TikTokAPI, self).__init__(cookie) # must increase recurison limit
        self.cookie = cookie
        self.tiktok_count = 0
        self.logger = logging.basicConfig(filename='tiktok.log', filemode='w',
                                          format=f'%(asctime)s - %(levelname)s - %(message)s')
        #self.tiktok_df = 

    def getVideosByHashtag(self, hashtags, count=3000):
        try:
            for hashTag in hashtags:
                try:
                    hashTag = hashTag.replace("#", "")
                    hashTag_obj = self.getHashTag(hashTag)
                    hashTag_id = hashTag_obj["challengeInfo"]["challenge"]["id"]
                except (KeyError, ReferenceError) as err:
                    logging.error(err)
                    continue
                else:
                    url = self.base_url + "/challenge/item_list/"
                    req_default_params = {
                        "secUid": "",
                        "type": "3",
                            "minCursor": "0",
                            "maxCursor": "0",
                            "shareUid": "",
                            "recType": ""
                        }
                    params = {
                        "challengeID": str(hashTag_id),
                            "count": str(count),
                            "cursor": "0",
                        }
                    for key, val in req_default_params.items():
                            params[key] = val
                    for key, val in self.default_params.items():
                            params[key] = val
                    extra_headers = {
                            "Referer": "https://www.tiktok.com/tag/" + str(hashTag)}
                    self.tiktok_count += 1
                    toks = self.send_get_request(url, params, extra_headers=extra_headers)
                    print(self.tiktok_count)
                    self.tiktok_df = pd.DataFrame(toks, columns=[
                        'postID', 'createTime', 'userID', 'description', 'musicId', 'soundId', 'tags'])
                    return toks
        except KeyboardInterrupt as ex:
            raise ex
        finally:
            logging.info(f'This run scraped {self.tiktok_count} TikToks')
            return self.tiktok_df
            # DATABASE.execute_mogrify(CXN, tiktok_df, )
            # return tiktok_df
            # TODO mogrify into database


In [44]:

api = TikTokAPI(cookie=tiktok_auth)
#api.tiktok_df =
tiktoks = news.all_news_df['keywords'].map(api.getVideosByHashtag)


AttributeError: 'TikTokAPI' object has no attribute 'cookie'

In [None]:
print(tiktok_df)

# 7. Add Late-Arriving Dimensions/Data
### *data corresponding to 3 days before news hit

In [None]:
# tweet search instead of stream
keywords = news.get_all_news()
tw = t.tweet_search(keywords)

# 7. Tally Up
### Partition total mentions by day

Add to database

In [None]:
# mogrify stream
# postgres_db.execute_mogrify(connection, filtered_stream, 'stream_tweets')
# mogrify batch tweets
# postgres_db.execute_mogrify(connection, tw, 'batch_tweets')
# # execute mogrify - insert news df into database
# postgres_db.execute_mogrify(connection, news.all_news_df, 'articles')
# mogrify 
postgres_db.execute_mogrify(connection, api.tiktok_df, 'tiktoks')

# TODO group by event? 

# 9. Plot & Analyze
- On which platform (Twitter or TikTok) do folks engage with politics the most?
- Where in the US is engagement the highest?
- Which political events seem to cause the most reaction among youth?