# Notebook Experimental

Dieses Notebook kann dazu verwendet werden, außerhalb des durch die anderen Notebooks vorgegebenen Prozesses Code zu schreiben und zu testen - ohne dabei direkten Einfluss auf den Prozess zu nehmen. Typische Anwendungsbeispiele finden Sie unten aufgelistet.

In [1]:
%load_ext autoreload
%autoreload 2

---

#### Anzeigen des Intermediate Datensatzes

In [2]:
import pandas as pd

df = pd.read_feather('../data/intermediate/twitter_tweets_intermediate.feather')
df.head(4)

Unnamed: 0,url,date,rawContent
0,https://twitter.com/_Bob_S/status/164159110508...,2023-03-30 23:59:00,Govt IT security isnt 'a nice thing to do': it...
1,https://twitter.com/WYSIWYGVentures/status/164...,2023-03-30 23:59:00,ISC West 2023: Cyberattackers Are Targeting Ph...
2,https://twitter.com/HackerAran7/status/1641591...,2023-03-30 23:59:00,What’s the hack. #stem #science #stemeducation...
3,https://twitter.com/bytefeedai/status/16415909...,2023-03-30 23:59:00,BuzzFeed Is Using AI To Write SEO-Bait Travel ...


#### Generierung der list_parameter_combinations.pkl

Die Erstellung dieser Datei ist notwenig, um das Hyperparameter Tuning headless durchführen zu können

In [1]:
# %%script false
from src import utils
import itertools

num_topics = [i for i in range(13, 24)]
alpha = ['symmetric', 'asymmetric'] + [0.1, 0.2, 0.3]
eta = [0.5, 0.6, 0.7, 0.9]
passes = [i for i in range(10, 13)]

combinations = list(itertools.product(num_topics, alpha, eta, passes))
combinations_dict = [{'num_topics': item[0], 'alpha': item[1], 'eta': item[2], 'passes': item[3]} for item in combinations]
utils.safe_as_pkl(combinations_dict, '../data/modeling/list_parameter_combinations.pkl')

#### Widerherstellen der Hyperparameter Tuning Resultate

Mithilfe der Cache-Datei können die Ergebnisse des Hyperparameter Tunings widerhergestellt werden

In [6]:
import pandas as pd
from src.utils import load_pkl

results = load_pkl('../data/modeling/cache_hyperparameter_tuning_2.pkl')
df = pd.DataFrame(results).astype(str)
df.to_feather('../data/modeling/ht_results_randomsearch.feather')

#### Berechnung wie viele Tweets Fremdspachen enthalten

In [2]:
from src.data.nitter_scraper_standalone_v2 import Tweet, TweetScraper
import pandas as pd

list_of_tweets = TweetScraper.load_collected_tweets(path='../data/raw/twitter_tweets_raw.pkl')
dict_of_tweets =  [{"url": tweet.url, "date": tweet.date, "rawContent": tweet.rawContent} for tweet in list_of_tweets]
df1 = pd.DataFrame(dict_of_tweets)
df2 = pd.read_feather('../data/intermediate/twitter_tweets_intermediate.feather')

x = len(df1) - len(df2) - 140772
x

2312

In [3]:
len(df2)

851926

#### Test twitter_scraper_standalone

In [6]:
from src.data.twitter_scraper_standalone import *
import pandas as pd
pd.set_option('display.max_colwidth', None)
_ = transform_to_dataframe('../data/raw/data_1687865107.pkl')
_.head(100)

Unnamed: 0,url,date,rawContent,lang,replyCount,retweetCount,likeCount
0,https://twitter.com/AgileScrumGuide/status/980595436082769920,2018-04-01 23:59:05+00:00,[ video ]\nAgile Has a Long and Colorful Heritage \nvimeo.com/259429846 \n\n#agile #scrum #business #IT #tech #technology #startup #innovators #visionary #visionaries #entrepreneur #history #agilehistory #heritage #agileheritage #timeline #success #minisode #vimeo #video https://t.co/CfME8acq5T,en,0,2,8
1,https://twitter.com/CioAmaro/status/980595423055314946,2018-04-01 23:59:02+00:00,Evidently there are no enough cyber security staff. Why?\n#Infosec #CyberSecurity #CyberAttack #Hack #Breach #Threat #DDoS #CyberWarfare #Malware #Ransomware #Cyberwarning #Phishing #SpyWare\n#Tech #Technology #Health #HealthTech https://t.co/xneGxntbwT,en,1,3,4
2,https://twitter.com/MelissaOnline/status/980595389131821056,2018-04-01 23:58:54+00:00,#Women in NYC #Tech: Sybil Steele of Temme Media alleywat.ch/2J9jQ7U via @alleywatch https://t.co/ENMnKi1yJk,en,0,0,3
3,https://twitter.com/DigitalKeith/status/980595211163328512,2018-04-01 23:58:11+00:00,A great #Startup bit.ly/2rO25WL #Innovation #GrowthHacking #bigdata #Disruption #makeyourownlane #defstar5 #IoT #Mpgvip by #homejobsbass https://t.co/Usylj2L2Ln,en,0,1,2
4,https://twitter.com/Ananna16/status/980595108759269376,2018-04-01 23:57:47+00:00,#Discount | 26 Best #MachineLearning and #DeepLearning Courses for #DataScientists \n\ngoo.gl/eN8ffC\n\n#AI #ArtificialIntelligence #DataScientist #IoT #IIoT #BigData #tech #Python #DataScience #Analytics #BigDataAnalytics #Business #tutorial #eLearning @ahmedjr_16,en,0,8,5
...,...,...,...,...,...,...,...
95,https://twitter.com/FusionBIM/status/980582574547456001,2018-04-01 23:07:58+00:00,The latest The fusionBIM Daily! paper.li/FusionBIM/1504… #innovation,en,0,1,1
96,https://twitter.com/grattongirl/status/980582506046148608,2018-04-01 23:07:42+00:00,What are some application use cases for #IoT?#BigData #SmartCity #fintech #AI #ML #startups #innovation @Fisher85M #Sensors #IIoT #Healthcare https://t.co/IccFT37tYe,en,0,7,5
97,https://twitter.com/climatebabes/status/980582485124943872,2018-04-01 23:07:37+00:00,Why not a moveable solar restaurant? #france #innovation #solar youtube.com/watch?v=XnghjZ…,en,0,1,1
98,https://twitter.com/agarwalm/status/980582444930945024,2018-04-01 23:07:27+00:00,The latest The Innovation Daily! paper.li/e-1499468771?e… #innovation #technology,en,0,1,1


---