### Enriching a tweet dataset with Name, Location, Description of the author and handles of the users who Retweeted, Liked, Commented 

**path_dataSet_A** : previously created dataset consisting of 1K tweet_id and tweet pairs of 500 users

**path_rtLikeCommentData** : previously created dataset consisting of lists of user handles of who liked, commented, and retweeted

**path_enriched_dataSet_A**: path to save the final dataset.


# 


In the following cells, 3 sample output from the dataset are given. If you check out the first record (i),

> **tweet text is** "A new year for artificial human intelligence  https://t.co/mXPCnsFnC6 #ArtificialIntelligence #ai #datascience #MachineLearning #automation #SelfDrivingCars #automation https://t.co/z6x08EeoEc"

> **user (author) handle is** "JeffreyBuskey"

> **location is** "Chicago"

> **description text is**  "Building productive Sales teams one Company at a time while increasing sales growth using Social Media and Marketing | Leadership| Sales Coach | CyberSecurity"

> **name of the users who retweeted are** "RobotConsumer"

> and **there are no likes or commentes**

With the script given in this notebook, we combine these different information (tweet, user and user activity data) into a single record by addding dummy tokens between them. Tweet and Rt&Like&Comment are collected before. Name&Location&Description are collected on the fly by using the Twitter API. 

i)

ii)

iii)

#

**Notes:**

**1) In our first attempt we had used the following words as dummy tokens:**

> \*&enr\*&, \*&name\*&, \*&loc\*&, \*&like\*&, \*&rt\*&, \*&comment\*&, \*&desc\*&

But we had found that they were considered as stop words (non-alphanumeric characters) in the text cleaning process. Therefore, we replaced them with the following words. These are the tokens that refer to "enrtag", "nametag", etc. in the paper. 

> _2764enrtag0918, 2764name0918, 2764loc0918, 2764like0918, 2764rt0918, 2764comment0918_

By this way, we made sure that they were considered as valid tokens. The purpose of adding numbers (a random combination) "2764-0918" is to make them unluckly to apper in the original tweet text or user info.

> _rpl -R "\*&enr\*&" "2764enrtag0918" #shell command for replacing words_


**2) (hiç like yok) , (hiç rt yok), and (hiç comment yok)**

 These are used as additional dummy tokens when one of the user lists (like, comment, rt) is empty: 
 <br> when there is no user who liked the tweet "(hiç like yok)" text is added
 <br> when there is no user who retweeted the tweet "(hiç rt yok)" text is added 
 <br> when there is no user who commented on the tweet "(hiç comment yok) " text is added between the related dummy tokens.
 
 "(hiç like yok)" means "there are no likes" in Turkish.
 <br> "(hiç comment yok)" means "there are no comments" in Turkish...

In [None]:
import os
import csv
import tweepy
import re


from tweepy import OAuthHandler


# fill in the values with your own Twitter API cridentials.
TWITTER_APP_KEY = ""
TWITTER_APP_SECRET = ""
TWITTER_KEY = ""
TWITTER_SECRET = ""

auth = OAuthHandler(TWITTER_APP_KEY, TWITTER_APP_SECRET)
auth.set_access_token(TWITTER_KEY, TWITTER_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True)


def createDir(dirPath):
    if not os.path.exists(dirPath):
        os.mkdir(dirPath)
        
        
def userNameLocScrn(user_name):
    jsonData ={}
    user = api.get_user(user_name)
    jsonData.update({"loc":user.location, "scrn":user.screen_name, "desc":user.description})
    return jsonData


def strValueUpdate(sentence, key, lookUp):
    for i in range(0,6):
        if lookUp!= " *&enr*& " and lookUp!="keyBitis":
            if i == 5 :
                sentence += " {} ".format(lookUp)
            else:
                sentence+=key
        else:
            if i<5:
                sentence += key
            else:
                break
    return sentence


def newCsvLine(oldList,newList,key_indis):
    keyArr = [" *&like*& ", " *&rt*& ", " *&comment*& "]
    lkRtCmmntNotFoundStr = ["(hiç like yok)", "(hiç rt yok)", "(hiç comment yok)"]
    if len(newList)==0:
        oldList = strValueUpdate(oldList, keyArr[key_indis],lkRtCmmntNotFoundStr[key_indis])
    else:
        for list_val in newList.split(','):
            oldList = strValueUpdate(oldList, keyArr[key_indis],list_val)
    oldList = strValueUpdate(oldList, keyArr[key_indis], "keyBitis")
    return oldList


def createEnrichedDataset(mainDir,newSaveDir,lRCTwtHpDir):
    for dirName in os.listdir(mainDir):
        newDirPath = os.path.join(newSaveDir,dirName)
        createDir(newDirPath)
        for subDirName in os.listdir(os.path.join(mainDir,dirName)):
            newSubDirPath = os.path.join(newDirPath, subDirName)
            createDir(newSubDirPath)
            for subDirNameFileName in os.listdir(os.path.join(mainDir,dirName,subDirName)):
                textFile = os.path.join(mainDir,dirName,subDirName,subDirNameFileName)
                lrcFile = os.path.join(lRCTwtHpDir,subDirName, subDirNameFileName)
                newSaveCsvFile =  os.path.join(newDirPath, subDirName,subDirNameFileName)
                with open(textFile, 'r',encoding="mbcs") as csvTextFile,open(lrcFile, 'r',encoding="mbcs") as csvLRCFile:
                    readerText = csv.reader(csvTextFile)
                    readerLRC = csv.reader(csvLRCFile)
                    for rowText in readerText:
                        newliste = []
                        newliste.append(rowText[0])
                        newliste.append(rowText[1])
                        newliste.append(rowText[2])
                        oldTwt = rowText[3]
                        for rowLRC in readerLRC:
                            if rowLRC[0].split('{!!}')[0]==rowText[0]:
                                locScrnDesc = userNameLocScrn(rowText[2])
                                newTwt = strValueUpdate(oldTwt," *&enr*& "," *&enr*& ")

                                newTwt = strValueUpdate(newTwt, " *&name*& ",locScrnDesc['scrn'])

                                newTwt = strValueUpdate(newTwt, " *&loc*& ",locScrnDesc['loc'])

                                newTwt = strValueUpdate(newTwt, " *&desc*& ",locScrnDesc['desc'])

                                listToStr = ', '.join(rowLRC)
                                LkRtCmtIndex=[m.start() for m in re.finditer('{!!}',listToStr)]
                                likeList = listToStr[LkRtCmtIndex[0]+4:LkRtCmtIndex[1]]
                                rtList = listToStr[LkRtCmtIndex[1] + 4:LkRtCmtIndex[2]]
                                comtList = listToStr[LkRtCmtIndex[2] + 4:]
                                newTwt = newCsvLine(newTwt, likeList, 0)
                                newTwt = newCsvLine(newTwt, rtList, 1)
                                newTwt = newCsvLine(newTwt, comtList, 2)
                                newliste.append(newTwt)
                                with open(newSaveCsvFile,'a',encoding='utf-8') as wF:
                                    writer = csv.writer(wF)
                                    writer.writerow(newliste)

                                break

In [None]:
input_path = os.path.join(os.getcwd(), "path_dataSet_A")
output_path = os.path.join(os.getcwd(), "path_enriched_dataSet_A")
rtLikeComment_path = os.path.join(os.getcwd(), "path_rtLikeCommentData")

createEnrichedDataset(input_path, output_path, rtLikeComment_path)