# Using **pyTwitterAnalysis** 

#### *This is a sample of some of the basic functionality available on pyTwitterAnalysis package*

***

 ##### **What this notebook will do**:
 + load twitter raw json files into a mongoDB database
 * create mongoDB collection to analyze the data
 + create edges files based on user connections (mentions, quotes, replies, retweets)
 + create folder structure to save all files (by period or not)
 + create the following files for each folder and sub folder
     + nodes with degrees 
     + edges
     + texts for topics
     + graph with lda model
     + graph plot
     + graph plot with contracted nodes
     + hashtag & words frequency list
     + hashtags & words barChart
     + timeseries plot (tweet count & hashtag count(
     + wordclouds (high degree nodes, high frequency hashtags, high frequency words)               

#### Initialize packages

In [None]:
import pyTwitterAnalysis as ta
from pymongo import MongoClient

#### Set your mongoDB connection

In [None]:
#db connection
mongoDBConnectionSTR = "mongodb://localhost:27017"
client = MongoClient(mongoDBConnectionSTR)
db = client.twitter_DB_MeToo_Tst3 #chose your DB name here

#### Set up the folder path you want to save all of the ouput files

In [None]:
BASE_PATH = 'D:\\Data\\myTwitterAnalysisFiles'

#### Initialize you twitterAnalysis object

In [None]:
x = ta.tw_analysis(BASE_PATH, db)

#### Import data from json files nto a mongoDB database

In [None]:
JSON_FILES_PATH = 'D:\\Data\\my_json_files'  ##this is the folder path where all of your twitter json files should be
x.loadDocFromFile(JSON_FILES_PATH)

#### Create database collection that will be used to analyse the data

In [None]:
step = 100000   # You can set the number of tweets to load at a time. (Large number may cause out of memory errors, low number may take a long time to run)
x.build_db_collections(step)

#### Set up the periods you want to analyse 
#####  ** Set period_arr to None if you don't want to analyze separate periods

In [None]:
#Format: Period Name, Period Start Date, Period End Date
period_arr = [['P1',  '10/08/2017 00:00:00', '10/15/2017 00:00:00'],
              ['P2',  '10/15/2017 00:00:00', '10/29/2017 00:00:00'],
              ['P3',  '10/29/2017 00:00:00', '11/12/2017 00:00:00'],
              ['P4',  '11/12/2017 00:00:00', '11/26/2017 00:00:00'],
              ['P5',  '11/26/2017 00:00:00', '12/10/2017 00:00:00'],
              ['P6',  '12/10/2017 00:00:00', '12/24/2017 00:00:00'],              
              ['P7',  '12/24/2017 00:00:00', '01/07/2018 00:00:00'],              
              ['P8',  '01/07/2018 00:00:00', '01/21/2018 00:00:00'],
              ['P9',  '01/21/2018 00:00:00', '02/04/2018 00:00:00'],              
              ['P10', '02/04/2018 00:00:00', '02/18/2018 00:00:00'],
              ['P11', '02/18/2018 00:00:00', '03/04/2018 00:00:00'],
              ['P12', '03/04/2018 00:00:00', '03/19/2018 00:00:00']]

#### Export edges
##### This is an important step. We will use the files created here in other steps 

In [None]:
x.export_all_edges_for_input(period_arr=period_arr)

#### Print initial EDA

In [None]:
x.eda_analysis()

### IMPORTANT STEP: Choose your settings here before running the next step
##### These variables will help you decide what files you want to see and with which parameters 
##### Running the next step could take a long time. If you want to run piece by piece so you can see results soon, you can change the flags to 'Y' on at the time

In [None]:
OUTPUT_PATH = BASE_PATH
IS_BOT_FILTER = None
PERIOD_ARR = period_arr 

#Choose which files you want to print - #Options: (Y/N)
CREATE_NODES_EDGES_FILES_FLAG = 'Y'   
CREATE_GRAPHS_FILES_FLAG = 'Y'
CREATE_TOPIC_MODEL_FILES_FLAG = 'Y'
CREATE_HT_FREQUENCY_FILES_FLAG = 'Y'
CREATE_WORDS_FREQUENCY_FILES_FLAG = 'Y'
CREATE_TIMESERIES_FILES_FLAG = 'Y'

#topic settings
NUM_OF_TOPICS = 6           #This is the number of topics to send as input to LDA model  (Default is 4)
TOP_NO_WORD_FILTER = None   #The number of words to save on the frequency word list file. (Default=5000)


#graph settings
COMTY_CONTRACT_PER = 90   #This a percetage number used to remove nodes so we can be able to plot large graphs. 
                          #You can run this logic multiple times with different percentages. Each time will save the graph file with a different name
GRAPH_PLOT_CUTOFF_NO_EDGES = 3000 #This is the number of edges cutoff to decide if we will print the graph or not. 
                                  #The logic will remove nodes until it can get to this number to plot
                                  #Large number may take a long time to run. Small number may contract nodes too much or not print the graph at all


#We will create subfolder for the top degree nodes based on these settings
TOP_DEGREE_START=1   
TOP_DEGREE_END=10

#We will create subfolders for the top degree nodes for each period based on these settings
PERIOD_TOP_DEGREE_START=1
PERIOD_TOP_DEGREE_END=5

##### This steps will create all the analysis files
##### You can run this multiple times using different parameters

In [None]:
x.edge_files_analysis(output_path=OUTPUT_PATH,
                      is_bot_Filter=IS_BOT_FILTER,
                      period_arr=PERIOD_ARR,
                      create_nodes_edges_files_flag=CREATE_NODES_EDGES_FILES_FLAG, 
                      create_graphs_files_flag=CREATE_GRAPHS_FILES_FLAG,
                      create_topic_model_files_flag=CREATE_TOPIC_MODEL_FILES_FLAG,
                      create_ht_frequency_files_flag=CREATE_HT_FREQUENCY_FILES_FLAG,
                      create_words_frequency_files_flag=CREATE_WORDS_FREQUENCY_FILES_FLAG,
                      create_timeseries_files_flag=CREATE_TIMESERIES_FILES_FLAG,
                      num_of_topics=NUM_OF_TOPICS, 
                      top_no_word_filter=TOP_NO_WORD_FILTER, 
                      graph_plot_cutoff_no_edges=GRAPH_PLOT_CUTOFF_NO_EDGES, 
                      comty_contract_per=COMTY_CONTRACT_PER,
                      top_degree_start=TOP_DEGREE_START,
                      top_degree_end=TOP_DEGREE_END, 
                      period_top_degree_start=PERIOD_TOP_DEGREE_START, 
                      period_top_degree_end=PERIOD_TOP_DEGREE_END
                     )