# Data Collection

Um später aussagekräftige Modelle erstellen zu können, müssen im ersten Schritt die dafür notwendigen Rohdaten gesammelt und ein Datensatz erstellt werden. Dazu werden Tweets von der Plattform Twitter gesammelt und in einer Liste gespeichert.

#### 1. Daten sammeln

Um ausschließlich relevante und die Branche betreffende Tweets zu sammeln werden dem Web-Scraper folgende Suchparameter übergeben:
- hashtags: #technology, #tech, #innovation
- language: en
- period: until:2023-03-31 since:2022-10-01
- limit: 1_000_000

In [1]:
%%script false
from src.data.nitter_scraper_standalone_v2 import *
from datetime import datetime

# set search parameters
query = '%28%23technology+OR+%23tech+OR+%23innovation%29+lang%3Aen&e-nativeretweets=on'
since = datetime(2022, 10, 1)
until = datetime(2023, 3, 31)
limit = 1_000_000


# collect twitter posts
scrape(q=query, since=since, until=until, limit=limit)

Couldn't find program: 'false'


#### 2. Datensatz erstellen

Nachdem die Twitter-Beiträge gesammelt wurden, werden diese in ein Dataframe transformiert.

In [2]:
from src.data.nitter_scraper_standalone_v2 import Tweet, TweetScraper
import pandas as pd

# load tweet objects
list_of_tweets = TweetScraper.load_collected_tweets(path='../data/raw/twitter_tweets_raw.pkl')

# transform into a dataframe
dict_of_tweets =  [{'url': tweet.url, 'date': tweet.date, 'rawContent': tweet.rawContent} for tweet in list_of_tweets]
df = pd.DataFrame(dict_of_tweets)

# save raw dataframe
df.to_feather('../data/raw/twitter_tweets_raw.feather')

#### 3. Datensatz anzeigen

In [3]:
# set the maximum width of the columns to unlimited
pd.set_option('display.max_colwidth', None)

df.head(6)

Unnamed: 0,url,date,rawContent
0,https://twitter.com/_Bob_S/status/1641591105082003457#m,"Mar 30, 2023 · 11:59 PM UTC",Govt IT security isnt 'a nice thing to do': it is a mandate. USERS should NOT BE INSTALLING any app/prgm on govt owned hardware. Period.\nNot a NEW CONCEPT. From 2014:\ncongress.gov/bill/113th-cong…\nIT configures hardware. THAT is how it should stay. #technology
1,https://twitter.com/WYSIWYGVentures/status/1641591100027568129#m,"Mar 30, 2023 · 11:59 PM UTC",ISC West 2023: Cyberattackers Are Targeting Physical Assets ift.tt/37S5PWf #Technology #Business #SMB #SmallBusiness
2,https://twitter.com/HackerAran7/status/1641591062329438209#m,"Mar 30, 2023 · 11:59 PM UTC",What’s the hack. #stem #science #stemeducation #education #engineering #steam #technology #womeninstem #coding #robotics #tech #learning #edtech #math #research #stemforkids #chemistry #biology #kids #programming #school #womeninscience #scientist #stemgirls #physics #innovation
3,https://twitter.com/bytefeedai/status/1641590978053128194#m,"Mar 30, 2023 · 11:59 PM UTC",BuzzFeed Is Using AI To Write SEO-Bait Travel Guides - bytefeed.ai/the-verge/buzzfe…\n#Technology #The Verge
4,https://twitter.com/bytefeedai/status/1641590971786997761#m,"Mar 30, 2023 · 11:59 PM UTC",Tech Leaders Sign Open Letter Calling for Pause on AI Development - bytefeed.ai/rolling-stone/te…\n#Technology #Rolling Stone
5,https://twitter.com/bytefeedai/status/1641590971094691841#m,"Mar 30, 2023 · 11:59 PM UTC",The View' Host Warns 'Everyone Should Be Scared' Of Artificial Intelligence - bytefeed.ai/fox-news/the-vie…\n#Technology #Fox News


---