# Cloud Vendors and related Tweets 
(https://www.kaggle.com/datasets/bwandowando/cloud-vendors-and-related-tweets-dataset)

In this notebook we are performing an analysis per market region (EMEA, North America, LATAM, APAC) for the 3 major cloud vendors so that we find out the most popular cloud provider per number of tweets as well as per the overall reach of the tweeets in Twitter. Additionally, we are going to perform topic modelling so that we find out the major topics that users are tweeting about per cloud provider for the North Americas region.

### Data Loading and Initial Exploration

In [1]:
#Load the dataset
import pandas as pd

df = pd.read_csv(r'Tweets_on_cloud_services_data\0423_to_0619_CloudProvidersTweets.csv')

In [2]:
df.shape[0] #Our initial data contain 1.6M rows

1625820

In [3]:
df.head(10) 

Unnamed: 0.1,Unnamed: 0,hashed_userid,masked_username,location,following,followers,totaltweets,usercreatedts,tweetid,tweetcreatedts,...,text,hashtags,language,favorite_count,is_retweet,original_tweet_id,in_reply_to_status_id,is_quote_status,quoted_status_id,extractedts
0,0,60387206788473740931,****enni,,370,1516,40930,2010-02-25 08:10:33.000000,1517992721239183362,2022-04-23 22:24:02.000000,...,Migrate your Classic Storage Account to Azure ...,"[{'text': 'Azure', 'indices': [104, 110]}, {'t...",en,0,False,0,0,False,0,2022-04-26 12:03:25.181694
1,1,11757760862330230620,*******angelo,Michigan,633,302,3793,2009-06-14 16:01:18.000000,1517993383700148227,2022-04-23 22:26:40.000000,...,Does anyone in #Blazor land have any successfu...,"[{'text': 'Blazor', 'indices': [15, 22]}, {'te...",en,1,False,0,0,False,0,2022-04-26 12:03:25.168930
2,2,67542035055337764482,********nce_jbs,,1,222,8786,2021-12-07 08:56:09.000000,1517994729950093313,2022-04-23 22:32:01.000000,...,Trane Technologies is looking for a Sr. Data S...,"[{'text': 'careers', 'indices': [146, 154]}, {...",en,1,False,0,0,False,0,2022-04-26 12:03:25.156006
3,3,53508755950760688283,*****ggets,We go where you go.,340,32212,28262,2009-02-09 16:25:45.000000,1517995750000664577,2022-04-23 22:36:04.000000,...,"Azure was built for developers, so it's no sur...","[{'text': 'ITexpert', 'indices': [165, 174]}, ...",en,2,False,0,0,False,0,2022-04-26 12:03:25.143810
4,4,60615872602451730756,*******affairs,Africa,913,1193,69889,2009-08-24 23:37:20.000000,1517995753372794880,2022-04-23 22:36:05.000000,...,Now playing on African Affairs Radio: Dangote ...,"[{'text': 'cloudcomputing', 'indices': [122, 1...",en,1,False,0,0,True,1361781471242039296,2022-04-26 12:03:25.131613
5,5,62142697060620806394,******s_feed,,1,170,10259,2021-11-23 23:00:44.000000,1517997244858961920,2022-04-23 22:42:00.000000,...,Guardian Life Insurance Company is looking for...,"[{'text': 'programming', 'indices': [149, 161]...",en,1,False,0,0,False,0,2022-04-26 12:03:25.119380
6,6,86851226483299649418,******ec_med,"Amsterdam, NL",688,907,23749,2015-03-24 07:48:50.000000,1517998046700789760,2022-04-23 22:45:12.000000,...,Deploying Azure Network Manager with Azure Bic...,"[{'text': 'azure', 'indices': [68, 74]}, {'tex...",en,0,False,0,0,False,0,2022-04-26 12:03:25.106191
7,7,88834411414998048685,******kup4U,"Toronto, Ontario",611,816,20539,2010-04-26 19:29:10.000000,1517998414717403137,2022-04-23 22:46:39.000000,...,Senior Architect - Permanent/Remote anywhere i...,"[{'text': 'softwaredesign', 'indices': [182, 1...",en,1,False,0,0,False,0,2022-04-26 12:03:25.093500
8,8,42390035568972706408,*******EZJR01,"Born in SP, lives in Brasilia",988,1641,62682,2018-12-30 19:27:21.000000,1517998775322791936,2022-04-23 22:48:05.000000,...,Watch Out! #Cryptocurrency Miners Targeting Do...,"[{'text': 'Cryptocurrency', 'indices': [11, 26...",en,1,False,0,0,False,0,2022-04-26 12:03:25.081036
9,9,41108708906982800530,****oper,,253,125,8320,2021-07-19 22:12:56.000000,1517998940737712132,2022-04-23 22:48:45.000000,...,"@lakersucklolz Also, this ratio does too\n\nDa...","[{'text': '100DaysOfCode', 'indices': [52, 66]...",en,9,False,0,1517998519575060480,False,0,2022-04-26 12:03:25.067920


We see that there are some redundant columns that can be safely dropped without losing any important information

In [4]:
df['language'].value_counts() #Language column is everywhere en, so it does not add any useful information

en    1625820
Name: language, dtype: int64

On the one hand, all of the retweets have an original tweet id that is different to 0

In [5]:
df.loc[df['is_retweet']==True].shape

(745579, 21)

In [6]:
df.loc[(df['is_retweet']==True) & (df['original_tweet_id']!=0)].shape[0]

745579

On the other hand, all of the original tweets have an original tweet id equal to 0

In [7]:
df.loc[df['is_retweet']==False].shape

(880241, 21)

In [8]:
df.loc[(df['is_retweet']==False) & (df['original_tweet_id']==0)].shape[0]

880241

Hence is_retweet column is redundant given the original_tweet_id column and can be safely dropped. We are also going to drop the index, masked_username, extractedts columns since they do not offer any meaning to our analysis

In [9]:
df.drop(['Unnamed: 0', 'masked_username', 'extractedts', 'is_retweet', 'language'], inplace=True, axis=1)

Since it is not clear, to me at least, what the "is_quote_status" column indicates, I am going to keep only the rows that are not a quote_status and drop all the relevant columns from the dataset

In [10]:
df['is_quote_status'].value_counts()

False    1545947
True       79873
Name: is_quote_status, dtype: int64

In [11]:
df = df.loc[df['is_quote_status'] == False]

In [12]:
df.drop(['in_reply_to_status_id', 'is_quote_status', 'quoted_status_id'], inplace=True, axis=1)

In [13]:
df.drop_duplicates(inplace=True) #drop any duplicated records

In [14]:
df.shape[0] #After the above pre-processing we have 1.5M rows left

1545947

We explore the hashtags column so to discover whether it can help us in our analysis.

In [15]:
import ast
list_of_hashtags = []
for i in range(df.shape[0]):    
    if i % 1000 == 0:
        print("Progress made: ", i/df.shape[0]*100, '%')
    hashtags = ast.literal_eval(df['hashtags'].iloc[i])
    for hash in hashtags:
        list_of_hashtags.append(hash['text'])        

Progress made:  0.0 %
Progress made:  0.0646852705817211 %
Progress made:  0.1293705411634422 %
Progress made:  0.19405581174516331 %
Progress made:  0.2587410823268844 %
Progress made:  0.3234263529086055 %
Progress made:  0.38811162349032663 %
Progress made:  0.45279689407204776 %
Progress made:  0.5174821646537688 %
Progress made:  0.58216743523549 %
Progress made:  0.646852705817211 %
Progress made:  0.7115379763989322 %
Progress made:  0.7762232469806533 %
Progress made:  0.8409085175623743 %
Progress made:  0.9055937881440955 %
Progress made:  0.9702790587258167 %
Progress made:  1.0349643293075377 %
Progress made:  1.0996495998892588 %
Progress made:  1.16433487047098 %
Progress made:  1.229020141052701 %
Progress made:  1.293705411634422 %
Progress made:  1.3583906822161433 %
Progress made:  1.4230759527978645 %
Progress made:  1.4877612233795854 %
Progress made:  1.5524464939613065 %
Progress made:  1.6171317645430276 %
Progress made:  1.6818170351247486 %
Progress made:  1.74

Progress made:  14.812926963214135 %
Progress made:  14.877612233795853 %
Progress made:  14.942297504377574 %
Progress made:  15.006982774959296 %
Progress made:  15.071668045541017 %
Progress made:  15.136353316122738 %
Progress made:  15.20103858670446 %
Progress made:  15.265723857286181 %
Progress made:  15.330409127867902 %
Progress made:  15.395094398449624 %
Progress made:  15.459779669031345 %
Progress made:  15.524464939613067 %
Progress made:  15.589150210194786 %
Progress made:  15.653835480776507 %
Progress made:  15.718520751358229 %
Progress made:  15.78320602193995 %
Progress made:  15.847891292521672 %
Progress made:  15.912576563103393 %
Progress made:  15.977261833685114 %
Progress made:  16.041947104266836 %
Progress made:  16.106632374848555 %
Progress made:  16.17131764543028 %
Progress made:  16.236002916011998 %
Progress made:  16.300688186593717 %
Progress made:  16.36537345717544 %
Progress made:  16.43005872775716 %
Progress made:  16.494743998338883 %
Progre

Progress made:  29.496483385264828 %
Progress made:  29.561168655846547 %
Progress made:  29.62585392642827 %
Progress made:  29.69053919700999 %
Progress made:  29.755224467591706 %
Progress made:  29.81990973817343 %
Progress made:  29.88459500875515 %
Progress made:  29.94928027933687 %
Progress made:  30.01396554991859 %
Progress made:  30.078650820500314 %
Progress made:  30.143336091082034 %
Progress made:  30.208021361663757 %
Progress made:  30.272706632245477 %
Progress made:  30.3373919028272 %
Progress made:  30.40207717340892 %
Progress made:  30.466762443990643 %
Progress made:  30.531447714572362 %
Progress made:  30.596132985154085 %
Progress made:  30.660818255735805 %
Progress made:  30.725503526317528 %
Progress made:  30.790188796899248 %
Progress made:  30.85487406748097 %
Progress made:  30.91955933806269 %
Progress made:  30.984244608644413 %
Progress made:  31.048929879226133 %
Progress made:  31.113615149807856 %
Progress made:  31.178300420389572 %
Progress mad

Progress made:  44.244725077897236 %
Progress made:  44.309410348478956 %
Progress made:  44.374095619060675 %
Progress made:  44.4387808896424 %
Progress made:  44.50346616022412 %
Progress made:  44.56815143080584 %
Progress made:  44.63283670138756 %
Progress made:  44.69752197196929 %
Progress made:  44.76220724255101 %
Progress made:  44.82689251313273 %
Progress made:  44.891577783714446 %
Progress made:  44.95626305429617 %
Progress made:  45.02094832487789 %
Progress made:  45.08563359545961 %
Progress made:  45.15031886604133 %
Progress made:  45.21500413662306 %
Progress made:  45.27968940720478 %
Progress made:  45.3443746777865 %
Progress made:  45.40905994836822 %
Progress made:  45.473745218949944 %
Progress made:  45.53843048953166 %
Progress made:  45.60311576011338 %
Progress made:  45.6678010306951 %
Progress made:  45.73248630127682 %
Progress made:  45.79717157185854 %
Progress made:  45.86185684244026 %
Progress made:  45.92654211302198 %
Progress made:  45.9912273

Progress made:  59.057652041111375 %
Progress made:  59.122337311693094 %
Progress made:  59.18702258227482 %
Progress made:  59.25170785285654 %
Progress made:  59.31639312343826 %
Progress made:  59.38107839401998 %
Progress made:  59.44576366460169 %
Progress made:  59.51044893518341 %
Progress made:  59.57513420576513 %
Progress made:  59.63981947634686 %
Progress made:  59.70450474692858 %
Progress made:  59.7691900175103 %
Progress made:  59.83387528809202 %
Progress made:  59.89856055867374 %
Progress made:  59.96324582925546 %
Progress made:  60.02793109983718 %
Progress made:  60.0926163704189 %
Progress made:  60.15730164100063 %
Progress made:  60.22198691158235 %
Progress made:  60.28667218216407 %
Progress made:  60.35135745274579 %
Progress made:  60.416042723327514 %
Progress made:  60.480727993909234 %
Progress made:  60.54541326449095 %
Progress made:  60.61009853507267 %
Progress made:  60.6747838056544 %
Progress made:  60.73946907623612 %
Progress made:  60.80415434

Progress made:  73.93526427490723 %
Progress made:  73.99994954548895 %
Progress made:  74.06463481607067 %
Progress made:  74.12932008665238 %
Progress made:  74.1940053572341 %
Progress made:  74.25869062781582 %
Progress made:  74.32337589839754 %
Progress made:  74.38806116897926 %
Progress made:  74.452746439561 %
Progress made:  74.51743171014272 %
Progress made:  74.58211698072444 %
Progress made:  74.64680225130616 %
Progress made:  74.71148752188788 %
Progress made:  74.7761727924696 %
Progress made:  74.84085806305131 %
Progress made:  74.90554333363303 %
Progress made:  74.97022860421477 %
Progress made:  75.03491387479649 %
Progress made:  75.0995991453782 %
Progress made:  75.16428441595993 %
Progress made:  75.22896968654165 %
Progress made:  75.29365495712337 %
Progress made:  75.35834022770509 %
Progress made:  75.4230254982868 %
Progress made:  75.48771076886854 %
Progress made:  75.55239603945026 %
Progress made:  75.61708131003198 %
Progress made:  75.6817665806137 %

Progress made:  88.74819123812135 %
Progress made:  88.81287650870307 %
Progress made:  88.8775617792848 %
Progress made:  88.94224704986652 %
Progress made:  89.00693232044824 %
Progress made:  89.07161759102996 %
Progress made:  89.13630286161168 %
Progress made:  89.2009881321934 %
Progress made:  89.26567340277512 %
Progress made:  89.33035867335684 %
Progress made:  89.39504394393857 %
Progress made:  89.4597292145203 %
Progress made:  89.52441448510201 %
Progress made:  89.58909975568373 %
Progress made:  89.65378502626545 %
Progress made:  89.71847029684717 %
Progress made:  89.78315556742889 %
Progress made:  89.84784083801061 %
Progress made:  89.91252610859235 %
Progress made:  89.97721137917407 %
Progress made:  90.04189664975578 %
Progress made:  90.1065819203375 %
Progress made:  90.17126719091922 %
Progress made:  90.23595246150094 %
Progress made:  90.30063773208266 %
Progress made:  90.36532300266438 %
Progress made:  90.43000827324612 %
Progress made:  90.4946935438278

In [16]:
list_of_hashtags_uniques = list(set(list_of_hashtags))

In [17]:
list_of_hashtags_uniques[:10] #Let's print some hashtags

['highefficiencyai',
 'MoCA',
 'GoogleDoc',
 'prolawgue',
 'MicrosoftPartnerCommunity',
 'Happy_Birthday',
 'chartistjs',
 'workstream',
 'educationsector',
 'twixorsolutions']

In [18]:
len(list_of_hashtags_uniques) #Overall, we have 215369 unique hashtags

215369

In [19]:
list_of_hashtags_uniques.sort()

In [20]:
new_list_of_hashtags_uniques = []
for hash in list_of_hashtags_uniques:
    new_list_of_hashtags_uniques.append(hash.lower())

Overall, there exist 911 distinct hashtags about AWS, surpassing the corresponding number for Azure - 672 and for Google - 474.

In [21]:
len([x for x in new_list_of_hashtags_uniques if 'aws' in x])

911

In [22]:
len([x for x in new_list_of_hashtags_uniques if 'azure' in x])

672

In [23]:
len([x for x in new_list_of_hashtags_uniques if 'google' in x])

474

We explore the tweetcreatedts_new column to discover whether it has to add anything meaningful in our analysis.

Given that the time window is just 3 months we cannot extract any meaningful trends from it.

In [24]:
df['tweetcreatedts_new'] = pd.to_datetime(df['tweetcreatedts'])

In [25]:
df['tweetcreatedts_date'] = df['tweetcreatedts_new'].dt.date

In [26]:
df['tweetcreatedts_date'].min()

datetime.date(2022, 4, 23)

In [27]:
df['tweetcreatedts_date'].max()

datetime.date(2022, 6, 19)

### Feature Engineering

#### Handling NAs 

We are going to check for the existence of NAs and handle them appropriately

In [28]:
df.isnull().sum() #The location column contains some null values

hashed_userid               0
location               504529
following                   0
followers                   0
totaltweets                 0
usercreatedts               0
tweetid                     0
tweetcreatedts              0
retweetcount                0
text                        0
hashtags                    0
favorite_count              0
original_tweet_id           0
tweetcreatedts_new          0
tweetcreatedts_date         0
dtype: int64

In [29]:
df['location'] = df['location'].fillna('') #We handle NAs by filling them with the empty string

#### Feature Creation

We are going to bring information about each tweet's country of origin. To do so, we first of all load a dataset that contains the names of most existing countries and then we search for the name of each country in the location column of our data

In [30]:
countries = pd.read_excel(r'Tweets_on_cloud_services_data\countries.xlsx')

In [31]:
countries

Unnamed: 0,Country
0,Afghanistan
1,Albania
2,Algeria
3,Andorra
4,Angola
...,...
190,Venezuela
191,Vietnam
192,Yemen
193,Zambia


We are going to traverse through the "location" column of our dataset and search whether it contains the name of an existing country, so that we can then know the country of origin for the particular tweet

In [32]:
list_of_countries = list(countries['Country'])
country = []
for i in range(df.shape[0]):    
    flag = 'not Resolved'
    if df['location'].iloc[i] == '':
        country.append('Cannot resolve country')
        continue
    if i % 1000 == 0 and i != 0:
        print('Completed ', i / df.shape[0] * 100, '% of the dataset')
        print('Amount of resolved countries: ', (len(country) - country.count('Cannot resolve country')) / len(country) * 100 ) 
    try:
        for cntr in list_of_countries:        
            if cntr in df['location'].iloc[i]:
                country.append(cntr)
                flag = 'Resolved'
                break
        if flag != 'Resolved':
            country.append('Cannot resolve country')
    except Exception as e:
        print(e)        
        country.append('Cannot resolve country')        

Completed  0.09602292259208119 % of the dataset
Amount of resolved countries:  22.5
Completed  0.19204584518416237 % of the dataset
Amount of resolved countries:  21.85
Completed  0.2880687677762435 % of the dataset
Amount of resolved countries:  23.333333333333332
Completed  0.38409169036832475 % of the dataset
Amount of resolved countries:  24.099999999999998
Completed  0.4801146129604059 % of the dataset
Amount of resolved countries:  24.4
Completed  0.576137535552487 % of the dataset
Amount of resolved countries:  24.8
Completed  0.6721604581445683 % of the dataset
Amount of resolved countries:  25.185714285714283
Completed  0.8642063033287307 % of the dataset
Amount of resolved countries:  25.211111111111112
Completed  0.9602292259208118 % of the dataset
Amount of resolved countries:  26.029999999999998
Completed  1.056252148512893 % of the dataset
Amount of resolved countries:  26.445454545454545
Completed  1.152275071104974 % of the dataset
Amount of resolved countries:  26.5166

Completed  11.33070486586558 % of the dataset
Amount of resolved countries:  23.9
Completed  11.426727788457661 % of the dataset
Amount of resolved countries:  23.93781512605042
Completed  11.522750711049742 % of the dataset
Amount of resolved countries:  23.885833333333334
Completed  11.618773633641823 % of the dataset
Amount of resolved countries:  23.904132231404958
Completed  11.906842401418066 % of the dataset
Amount of resolved countries:  23.86048387096774
Completed  12.002865324010148 % of the dataset
Amount of resolved countries:  23.799999999999997
Completed  12.098888246602229 % of the dataset
Amount of resolved countries:  23.805555555555554
Completed  12.290934091786392 % of the dataset
Amount of resolved countries:  23.81953125
Completed  12.386957014378472 % of the dataset
Amount of resolved countries:  23.813178294573646
Completed  12.579002859562635 % of the dataset
Amount of resolved countries:  23.72137404580153
Completed  12.675025782154718 % of the dataset
Amount o

Completed  23.909707725428213 % of the dataset
Amount of resolved countries:  22.910441767068274
Completed  24.005730648020297 % of the dataset
Amount of resolved countries:  22.892000000000003
Completed  24.101753570612377 % of the dataset
Amount of resolved countries:  22.88605577689243
Completed  24.197776493204458 % of the dataset
Amount of resolved countries:  22.901587301587302
Completed  24.485845260980703 % of the dataset
Amount of resolved countries:  22.862745098039216
Completed  24.581868183572784 % of the dataset
Amount of resolved countries:  22.832421875
Completed  24.677891106164864 % of the dataset
Amount of resolved countries:  22.825291828793777
Completed  24.773914028756945 % of the dataset
Amount of resolved countries:  22.8015503875969
Completed  24.869936951349025 % of the dataset
Amount of resolved countries:  22.78648648648649
Completed  25.15800571912527 % of the dataset
Amount of resolved countries:  22.688931297709924
Completed  25.25402864171735 % of the dat

Completed  36.968825197951254 % of the dataset
Amount of resolved countries:  21.488051948051947
Completed  37.06484812054334 % of the dataset
Amount of resolved countries:  21.509067357512954
Completed  37.160871043135415 % of the dataset
Amount of resolved countries:  21.518863049095607
Completed  37.44893981091166 % of the dataset
Amount of resolved countries:  21.567692307692308
Completed  37.54496273350374 % of the dataset
Amount of resolved countries:  21.58286445012788
Completed  37.737008578687906 % of the dataset
Amount of resolved countries:  21.617557251908398
Completed  37.92905442387207 % of the dataset
Amount of resolved countries:  21.648354430379747
Completed  38.02507734646415 % of the dataset
Amount of resolved countries:  21.635858585858585
Completed  38.12110026905623 % of the dataset
Amount of resolved countries:  21.634508816120906
Completed  38.313146114240396 % of the dataset
Amount of resolved countries:  21.62280701754386
Completed  38.69723780460871 % of the 

Completed  50.31601143825054 % of the dataset
Amount of resolved countries:  21.9162213740458
Completed  50.41203436084262 % of the dataset
Amount of resolved countries:  21.908
Completed  50.5080572834347 % of the dataset
Amount of resolved countries:  21.898669201520914
Completed  50.60408020602678 % of the dataset
Amount of resolved countries:  21.894307400379507
Completed  50.70010312861887 % of the dataset
Amount of resolved countries:  21.88655303030303
Completed  50.98817189639511 % of the dataset
Amount of resolved countries:  21.862900188323916
Completed  51.08419481898719 % of the dataset
Amount of resolved countries:  21.849624060150376
Completed  51.18021774157927 % of the dataset
Amount of resolved countries:  21.85403377110694
Completed  51.27624066417135 % of the dataset
Amount of resolved countries:  21.86310861423221
Completed  51.37226358676343 % of the dataset
Amount of resolved countries:  21.873457943925235
Completed  51.5643094319476 % of the dataset
Amount of res

Completed  61.45467045893196 % of the dataset
Amount of resolved countries:  21.71578125
Completed  61.550693381524034 % of the dataset
Amount of resolved countries:  21.723088923556944
Completed  61.646716304116126 % of the dataset
Amount of resolved countries:  21.717757009345796
Completed  61.83876214930029 % of the dataset
Amount of resolved countries:  21.691149068322982
Completed  61.934785071892364 % of the dataset
Amount of resolved countries:  21.695658914728682
Completed  62.126830917076525 % of the dataset
Amount of resolved countries:  21.694435857805257
Completed  62.2228538396686 % of the dataset
Amount of resolved countries:  21.69537037037037
Completed  62.41489968485276 % of the dataset
Amount of resolved countries:  21.682615384615385
Completed  62.510922607444854 % of the dataset
Amount of resolved countries:  21.679877112135177
Completed  62.60694553003693 % of the dataset
Amount of resolved countries:  21.676687116564416
Completed  62.702968452629015 % of the datas

Completed  73.36151286035002 % of the dataset
Amount of resolved countries:  21.685732984293193
Completed  73.4575357829421 % of the dataset
Amount of resolved countries:  21.67934640522876
Completed  73.55355870553419 % of the dataset
Amount of resolved countries:  21.674543080939948
Completed  73.64958162812627 % of the dataset
Amount of resolved countries:  21.671968709256845
Completed  73.84162747331044 % of the dataset
Amount of resolved countries:  21.6629388816645
Completed  74.03367331849459 % of the dataset
Amount of resolved countries:  21.652658884565497
Completed  74.22571916367875 % of the dataset
Amount of resolved countries:  21.65135834411384
Completed  74.32174208627083 % of the dataset
Amount of resolved countries:  21.645865633074933
Completed  74.41776500886292 % of the dataset
Amount of resolved countries:  21.643612903225808
Completed  74.70583377663917 % of the dataset
Amount of resolved countries:  21.62737789203085
Completed  74.80185669923124 % of the dataset


Completed  85.7484698747285 % of the dataset
Amount of resolved countries:  21.720828667413215
Completed  85.84449279732058 % of the dataset
Amount of resolved countries:  21.725950782997764
Completed  86.03653864250474 % of the dataset
Amount of resolved countries:  21.726339285714285
Completed  86.13256156509682 % of the dataset
Amount of resolved countries:  21.721404682274247
Completed  86.22858448768889 % of the dataset
Amount of resolved countries:  21.7173719376392
Completed  86.32460741028099 % of the dataset
Amount of resolved countries:  21.710901001112347
Completed  86.42063033287306 % of the dataset
Amount of resolved countries:  21.707666666666668
Completed  86.61267617805723 % of the dataset
Amount of resolved countries:  21.70110864745011
Completed  86.90074494583348 % of the dataset
Amount of resolved countries:  21.681657458563535
Completed  86.99676786842555 % of the dataset
Amount of resolved countries:  21.676710816777042
Completed  87.18881371360972 % of the datase

Completed  98.61554150206737 % of the dataset
Amount of resolved countries:  21.07487828627069
Completed  98.71156442465946 % of the dataset
Amount of resolved countries:  21.07451361867704
Completed  98.80758734725154 % of the dataset
Amount of resolved countries:  21.07405247813411
Completed  98.90361026984363 % of the dataset
Amount of resolved countries:  21.069611650485438
Completed  98.9996331924357 % of the dataset
Amount of resolved countries:  21.064209505334627
Completed  99.09565611502778 % of the dataset
Amount of resolved countries:  21.056782945736437
Completed  99.19167903761986 % of the dataset
Amount of resolved countries:  21.05188770571152
Completed  99.38372488280403 % of the dataset
Amount of resolved countries:  21.043381642512077
Completed  99.4797478053961 % of the dataset
Amount of resolved countries:  21.037355212355212
Completed  99.76781657317235 % of the dataset
Amount of resolved countries:  21.040230991337825
Completed  100.05588534094858 % of the dataset

Completed  111.19454436163001 % of the dataset
Amount of resolved countries:  20.869343696027634
Completed  111.38659020681416 % of the dataset
Amount of resolved countries:  20.864741379310345
Completed  111.57863605199833 % of the dataset
Amount of resolved countries:  20.851290877796902
Completed  111.67465897459041 % of the dataset
Amount of resolved countries:  20.84883920894239
Completed  111.86670481977458 % of the dataset
Amount of resolved countries:  20.832875536480685
Completed  112.05875066495874 % of the dataset
Amount of resolved countries:  20.821079691516708
Completed  112.15477358755082 % of the dataset
Amount of resolved countries:  20.81669520547945
Completed  112.25079651014289 % of the dataset
Amount of resolved countries:  20.80923866552609
Completed  112.34681943273499 % of the dataset
Amount of resolved countries:  20.805726495726496
Completed  112.44284235532706 % of the dataset
Amount of resolved countries:  20.801537147736976
Completed  112.53886527791914 % o

Completed  123.77354722119264 % of the dataset
Amount of resolved countries:  20.734057408844066
Completed  123.9655930663768 % of the dataset
Amount of resolved countries:  20.743764523625096
Completed  124.15763891156098 % of the dataset
Amount of resolved countries:  20.74934261407579
Completed  124.25366183415305 % of the dataset
Amount of resolved countries:  20.75092735703246
Completed  124.34968475674512 % of the dataset
Amount of resolved countries:  20.74803088803089
Completed  124.4457076793372 % of the dataset
Amount of resolved countries:  20.74837962962963
Completed  124.5417306019293 % of the dataset
Amount of resolved countries:  20.74880493446415
Completed  124.73377644711346 % of the dataset
Amount of resolved countries:  20.750038491147034
Completed  124.82979936970553 % of the dataset
Amount of resolved countries:  20.747076923076925
Completed  124.92582229229762 % of the dataset
Amount of resolved countries:  20.74481168332052
Completed  125.02184521488971 % of the 

Completed  137.31277930667608 % of the dataset
Amount of resolved countries:  20.667202797202798
Completed  137.50482515186025 % of the dataset
Amount of resolved countries:  20.661662011173185
Completed  137.60084807445233 % of the dataset
Amount of resolved countries:  20.664131193300765
Completed  137.6968709970444 % of the dataset
Amount of resolved countries:  20.660878661087867
Completed  137.7928939196365 % of the dataset
Amount of resolved countries:  20.65742160278746
Completed  137.88891684222858 % of the dataset
Amount of resolved countries:  20.651114206128135
Completed  137.98493976482067 % of the dataset
Amount of resolved countries:  20.654418928322897
Completed  138.08096268741275 % of the dataset
Amount of resolved countries:  20.650904033379692
Completed  138.2730085325969 % of the dataset
Amount of resolved countries:  20.641180555555554
Completed  138.36903145518897 % of the dataset
Amount of resolved countries:  20.636641221374045
Completed  138.46505437778106 % of

Completed  147.49120910143668 % of the dataset
Amount of resolved countries:  20.43665364583333
Completed  147.68325494662088 % of the dataset
Amount of resolved countries:  20.432119635890768
Completed  147.77927786921293 % of the dataset
Amount of resolved countries:  20.431708901884342
Completed  148.06734663698919 % of the dataset
Amount of resolved countries:  20.421984435797665
Completed  148.16336955958127 % of the dataset
Amount of resolved countries:  20.41697990926766
Completed  148.25939248217335 % of the dataset
Amount of resolved countries:  20.4175518134715


In [33]:
len(country) #Verify the length of the country list so we can create a new column in our data

1545947

In [34]:
df['country'] = country

In [35]:
df.loc[df['country']!='Cannot resolve country'].shape #Overall, we successfully identifies the country of origin for 315711 rows or 20% of our data

(315711, 16)

In [36]:
df_with_countries = df.loc[df['country']!='Cannot resolve country']

As we can see below, the country identification mechanism identified correctly all of the following records

In [37]:
df_with_countries[['location', 'country']][:20] 

Unnamed: 0,location,country
20,"Islamabad, Pakistan",Pakistan
22,"Islamabad, Pakistan",Pakistan
48,Canada,Canada
52,"US, Singapore and India",India
70,"Waterloo, Ontario Canada",Canada
71,"Brussels, Belgium",Belgium
72,New Zealand,New Zealand
73,Thailand,Thailand
74,Philippines,Philippines
80,"Manila, Philippines",Philippines


We are going to manualy check the remaining records that we failed to identify their country of origin. We want to resolve the country for all locations that appear more than 1000 times in our data

In [38]:
df.loc[df['country']=='Cannot resolve country', 'location'].value_counts().to_clipboard() 

Given our manual inspection of the data, we can safely assign the following locations to each of the following countries

In [39]:
usa_places_list = ['United States', 'Los Angeles, CA', 'New York, NY', 'Detroit, MI', 'San Francisco, CA', 'USA', 'Miami, FL', 'Chicago, IL', 'New York, USA', 'Washington, DC', "Town 'n' Country, FL", 'Austin, TX', 'Texas, USA', 'Seattle', 'California, USA', 'Tampa, FL', 'Las Vegas, NV', 'Seattle, WA', 'Boston, MA', 'Florida, USA', 'New York', 'Maryland, USA', 'Atlanta, GA', 'Pittsburgh, PA', 'Dallas, TX', 'Houston, TX', 'New Jersey, USA', 'San Francisco', 'San Jose, CA', 'Sacramento, CA', 'North Carolina, USA', 'Washington, USA', 'Troy, MI, US, 48083', 'Miami, usa', 'Florida', 'Texas', 'Palo Alto, CA', 'Titusville, FL', 'San Diego, CA', 'Alpharetta, GA', 'Lakeville, MN']

In [40]:
import swifter 

df['country_new'] = df.swifter.apply(lambda x: 'United States of America' if x['location'] in usa_places_list else x['country'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [41]:
df['country_new'] = df.swifter.apply(lambda x: 'Brazil' if x['location'] in ['Rio de janeiro'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [42]:
df['country_new'] = df.swifter.apply(lambda x: 'Argentina' if x['location'] in ['Buenos Aires'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [43]:
df['country_new'] = df.swifter.apply(lambda x: 'United Kingdom' if x['location'] in ['London', 'London, UK', 'UK', 'Bradford, Yorkshire', 'Carlisle, England', 'Hounslow, London', 'London, England'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [44]:
df['country_new'] = df.swifter.apply(lambda x: 'India' if x['location'] in ['Delhi', 'INDIA', 'Mumbai', 'Bangalore', 'New Delhi', 'Karachi'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [45]:
df['country_new'] = df.swifter.apply(lambda x: 'Germany' if x['location'] in ['Mysore  and  BERLIN', 'Hamburg'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [46]:
df['country_new'] = df.swifter.apply(lambda x: 'Canada' if x['location'] in ['Toronto, Ontario', 'Toronto'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [47]:
df['country_new'] = df.swifter.apply(lambda x: 'Mexico' if x['location'] in ['Cancún', 'México'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [48]:
df['country_new'] = df.swifter.apply(lambda x: 'Australia' if x['location'] in ['Melbourne, Victoria'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [49]:
df['country_new'] = df.swifter.apply(lambda x: 'Bosnia and Herzegovina' if x['location'] in ['Sarajevo'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [50]:
df['country_new'] = df.swifter.apply(lambda x: 'India' if x['location'] in ['Chennai'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [51]:
df['country_new'] = df.swifter.apply(lambda x: 'France' if x['location'] in ['Paris'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [52]:
df['country_new'] = df.swifter.apply(lambda x: 'Pakistan' if x['location'] in ['Islamabad'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [53]:
df['country_new'] = df.swifter.apply(lambda x: 'Spain' if x['location'] in ['Madrid'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [54]:
df['country_new'] = df.swifter.apply(lambda x: 'Colombia' if x['location'] in ['Envigado, antioquia'] else x['country_new'], axis=1)

Pandas Apply:   0%|          | 0/1545947 [00:00<?, ?it/s]

In [55]:
df.loc[df['country_new']=='Cannot resolve country', 'location'].value_counts().to_clipboard()

After manual inspection we were able to identify the country for 561289 records. So by manually inspecting our data we resolved the country for 77% more records than we had initially detected

In [56]:
df.loc[df['country_new']!='Cannot resolve country'].shape 

(561289, 17)

We are going to work only with the records that we were able to resolve their country

In [57]:
df_with_countries = df.loc[df['country_new']!='Cannot resolve country']

In [58]:
df_with_countries.shape

(561289, 17)

#### Natural Language Processing

We are going to clean the text column so that we can later perform topic modelling on it. To do so, we will first make all letters lower case, we are removing any special case characters, we are removing all url addresses, we remove extra spaces in between, we remove punctuation, we remove any non-ascii phrases, we use fasttext lang to detect foreign languages, sice we want to keep only tweets in English, we remove stopwords and we finally lemmatize our text

In [59]:
import re
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
import fasttext

model = fasttext.load_model(r'Tweets_on_cloud_services_data/lid.176.ftz')
st = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

df_with_countries['clean_text'] = df_with_countries['text'].swifter.apply(lambda x: x.lower().strip())

df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: ' '.join(text for text in x.split() if text[0] not in ['!', '@', '#', '$', '%', '^', '&', '*']))

df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: ' '.join(text for text in x.split() if text[:4] != 'http'))

# remove extra spaces in between
df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: re.sub(' +', ' ', x))

# remove punctuation
df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

df_with_countries.reset_index(inplace=True)
indices = []
for i in range(df_with_countries.shape[0]):    
    if df_with_countries['clean_text'].iloc[i].isascii():                
        indices.append(i)        
        
df_with_countries = df_with_countries.loc[df_with_countries.index[indices]]

df_with_countries.reset_index(inplace=True)
indices = []
for i in range(df_with_countries.shape[0]):
    lang_pred = model.predict(df_with_countries['clean_text'].iloc[i], k=1)    
    if lang_pred[1][0] > 0.7 and lang_pred[0][0][9:] == 'en':        
        indices.append(i)
        
df_with_countries = df_with_countries.loc[df_with_countries.index[indices]]

# remove stopwords
df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: ' '.join(text for text in x.split() if text not in stop_words))

# remove stopwords and get the stem
df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: ' '.join(lemmatizer.lemmatize(text) for text in x.split()))



Pandas Apply:   0%|          | 0/561289 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_countries['clean_text'] = df_with_countries['text'].swifter.apply(lambda x: x.lower().strip())


Pandas Apply:   0%|          | 0/561289 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: ' '.join(text for text in x.split() if text[0] not in ['!', '@', '#', '$', '%', '^', '&', '*']))


Pandas Apply:   0%|          | 0/561289 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: ' '.join(text for text in x.split() if text[:4] != 'http'))


Pandas Apply:   0%|          | 0/561289 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: re.sub(' +', ' ', x))


Pandas Apply:   0%|          | 0/561289 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_with_countries['clean_text'] = df_with_countries['clean_text'].swifter.apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))


Pandas Apply:   0%|          | 0/452977 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/452977 [00:00<?, ?it/s]

#### Divide in Market Regions

We want to bring information about the market region each country belongs into

In [60]:
regions = pd.read_excel(r'Tweets_on_cloud_services_data\countries_regions.xlsx')

In [61]:
regions

Unnamed: 0,Country,Region
0,Andorra,EMEA
1,United Arab Emirates,EMEA
2,Afghanistan,EMEA
3,Antigua and Barbuda,LATAM
4,Anguilla,LATAM
...,...,...
242,Yemen,EMEA
243,Mayotte,EMEA
244,South Africa,EMEA
245,Zambia,EMEA


In [62]:
df_with_countries = pd.merge(df_with_countries, regions, left_on = 'country_new', right_on='Country', how='left')

In [63]:
df_with_countries['Region'].value_counts() #Each region contains around the same number of tweets except LATAM

NOAM     145743
EMEA     142573
APAC     141482
LATAM     23179
Name: Region, dtype: int64

In [64]:
df_with_countries['Region'].isnull().sum() #There is no country in our data that has not be assigned to a region

0

#### Feature Creation

We create 3 new columns, each one indicating whether a particular tweet is mentioning a cloud vendor

In [65]:
df_with_countries['processed_text'] = df_with_countries['text'].apply(lambda x: x.lower().strip())

df_with_countries['is_aws'] = df_with_countries.swifter.apply(lambda x: 1 if 'aws' in x['processed_text'] else 0, axis=1)
df_with_countries['is_azure'] = df_with_countries.swifter.apply(lambda x: 1 if 'azure' in x['processed_text'] else 0, axis=1)
df_with_countries['is_gcp'] = df_with_countries.swifter.apply(lambda x: 1 if 'google' in x['processed_text'] else 0, axis=1)

Pandas Apply:   0%|          | 0/452977 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/452977 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/452977 [00:00<?, ?it/s]

We drop all the columns that we are not going to use in our analysis

In [66]:
df_with_countries.drop(['level_0', 'index', 'location', 'country', 'Country', 'processed_text', 'text', 'hashtags'], inplace=True, axis=1)
df_with_countries.rename(columns={'country_new': 'country'}, inplace=True)

### Cloud Vendors popularity by Market Region - number of tweets

Number of tweets about each cloud vendor per market region

In [67]:
noam_region = df_with_countries.loc[df_with_countries['Region']=='NOAM']
print('Number of tweets about Azure in North America Region: ', noam_region['is_azure'].sum())
print('Number of tweets about AWS in North America Region: ', noam_region['is_aws'].sum())
print('Number of tweets about Google Cloud in North America Region: ', noam_region['is_gcp'].sum())

Number of tweets about Azure in North America Region:  9961
Number of tweets about AWS in North America Region:  5197
Number of tweets about Google Cloud in North America Region:  2381


In [68]:
apac_region = df_with_countries.loc[df_with_countries['Region']=='APAC']
print('Number of tweets about Azure in Asia Pacific Region: ', apac_region['is_azure'].sum())
print('Number of tweets about AWS in Asia Pacific Region: ', apac_region['is_aws'].sum())
print('Number of tweets about Google Cloud in Asia Pacific Region: ', apac_region['is_gcp'].sum())

Number of tweets about Azure in Asia Pacific Region:  7247
Number of tweets about AWS in Asia Pacific Region:  8812
Number of tweets about Google Cloud in Asia Pacific Region:  3460


In [69]:
emea_region = df_with_countries.loc[df_with_countries['Region']=='EMEA']
print('Number of tweets about Azure in EMEA Region: ', emea_region['is_azure'].sum())
print('Number of tweets about AWS in EMEA Region: ', emea_region['is_aws'].sum())
print('Number of tweets about Google Cloud in EMEA Region: ', emea_region['is_gcp'].sum())

Number of tweets about Azure in EMEA Region:  7423
Number of tweets about AWS in EMEA Region:  5963
Number of tweets about Google Cloud in EMEA Region:  2538


In [70]:
latam_region = df_with_countries.loc[df_with_countries['Region']=='LATAM']
print('Number of tweets about Azure in Latin America Region: ', latam_region['is_azure'].sum())
print('Number of tweets about AWS in Latin America Region: ', latam_region['is_aws'].sum())
print('Number of tweets about Google Cloud in Latin America Region: ', latam_region['is_gcp'].sum())

Number of tweets about Azure in Latin America Region:  3404
Number of tweets about AWS in Latin America Region:  2576
Number of tweets about Google Cloud in Latin America Region:  322


Frequency of tweets about a particular cloud vendor over all tweets about cloud vendors

In [71]:
noam_region = df_with_countries.loc[df_with_countries['Region']=='NOAM']

count_of_tweets = (noam_region['is_azure'].sum() + noam_region['is_aws'].sum() + noam_region['is_gcp'].sum())

print('Ratio of tweets about Azure in North America Region: ', round(noam_region['is_azure'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about AWS in North America Region: ', round(noam_region['is_aws'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about Google Cloud in North America Region: ', round(noam_region['is_gcp'].sum() / count_of_tweets * 100, 2))

Ratio of tweets about Azure in North America Region:  56.79
Ratio of tweets about AWS in North America Region:  29.63
Ratio of tweets about Google Cloud in North America Region:  13.58


In [72]:
apac_region = df_with_countries.loc[df_with_countries['Region']=='APAC']

count_of_tweets = (apac_region['is_azure'].sum() + apac_region['is_aws'].sum() + apac_region['is_gcp'].sum())

print('Ratio of tweets about Azure in Asia Pacific Region: ', round(apac_region['is_azure'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about AWS in Asia Pacific Region: ', round(apac_region['is_aws'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about Google Cloud in Asia Pacific Region: ', round(apac_region['is_gcp'].sum() / count_of_tweets * 100, 2))

Ratio of tweets about Azure in Asia Pacific Region:  37.13
Ratio of tweets about AWS in Asia Pacific Region:  45.15
Ratio of tweets about Google Cloud in Asia Pacific Region:  17.73


In [73]:
emea_region = df_with_countries.loc[df_with_countries['Region']=='EMEA']

count_of_tweets = (emea_region['is_azure'].sum() + emea_region['is_aws'].sum() + emea_region['is_gcp'].sum())

print('Ratio of tweets about Azure in EMEA Region: ', round(emea_region['is_azure'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about AWS in EMEA Region: ', round(emea_region['is_aws'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about Google Cloud in EMEA Region: ', round(emea_region['is_gcp'].sum() / count_of_tweets * 100, 2))

Ratio of tweets about Azure in EMEA Region:  46.62
Ratio of tweets about AWS in EMEA Region:  37.45
Ratio of tweets about Google Cloud in EMEA Region:  15.94


In [74]:
latam_region = df_with_countries.loc[df_with_countries['Region']=='LATAM']

count_of_tweets = (latam_region['is_azure'].sum() + latam_region['is_aws'].sum() + latam_region['is_gcp'].sum())

print('Ratio of tweets about Azure in Latin America Region: ', round(latam_region['is_azure'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about AWS in Latin America Region: ', round(latam_region['is_aws'].sum() / count_of_tweets * 100, 2))
print('Ratio of tweets about Google Cloud in Latin America Region: ', round(latam_region['is_gcp'].sum() / count_of_tweets * 100, 2))

Ratio of tweets about Azure in Latin America Region:  54.01
Ratio of tweets about AWS in Latin America Region:  40.88
Ratio of tweets about Google Cloud in Latin America Region:  5.11


### Cloud Vendors popularity by Market Region - tweet visibility

Although number of tweets is a good indicator of each cloud vendor's popularity, there are some tweets that are much more visible than others. A tweet's visibility can be calculated using 3 key drivers: 1. The number of people following the user making the tweet, 2. The number of times that a tweet has been retweeted, 3. The number of people that have liked the tweet. The higher the number in each category, the higher will be the visibility of the tweet as well. Moreover, the weight that each of the 3 key drivers provide to a tweet's visibility is not the same, since a tweet with 100 likes will have received a much higher visibility than a tweet made by a tweetor with 100 followers, since in order to receive 100 likes it means that the tweet has been viewed by much more than 100 people. Furthermore, a tweet with 100 retweets will have received a much higher visibility than a tweet with 100 likes, since a user is more prone to like a tweet than retweet it to his network. Although subjective we are going to assign a visibility weight of 0.6 per retweet, a weight of 0.3 per like and finally a weight of 0.1 per number of followers.

In [75]:
def reach_score(x, y, z):
    return 0.6 * x + 0.3 * y + 0.1 * z

In [76]:
df_with_countries['reach'] = df_with_countries.swifter.apply(lambda x: reach_score(x['retweetcount'], x['favorite_count'], x['followers']), axis=1)

In [77]:
noam_region = df_with_countries.loc[df_with_countries['Region']=='NOAM']

reach_of_tweets = (noam_region.loc[noam_region['is_azure']==1, 'reach'].sum() + noam_region.loc[noam_region['is_aws']==1, 'reach'].sum() + noam_region.loc[noam_region['is_gcp']==1, 'reach'].sum())

print('Reach of tweets about Azure in North America Region: ', round(noam_region.loc[noam_region['is_azure']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about AWS in North America Region: ', round(noam_region.loc[noam_region['is_aws']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about Google Cloud in North America Region: ', round(noam_region.loc[noam_region['is_gcp']==1, 'reach'].sum() / reach_of_tweets * 100, 2))

Reach of tweets about Azure in North America Region:  15.86
Reach of tweets about AWS in North America Region:  71.42
Reach of tweets about Google Cloud in North America Region:  12.72


In [78]:
apac_region = df_with_countries.loc[df_with_countries['Region']=='APAC']

reach_of_tweets = (apac_region.loc[apac_region['is_azure']==1, 'reach'].sum() + apac_region.loc[apac_region['is_aws']==1, 'reach'].sum() + apac_region.loc[apac_region['is_gcp']==1, 'reach'].sum())

print('Reach of tweets about Azure in Asia Pacific Region: ', round(apac_region.loc[apac_region['is_azure']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about AWS in Asia Pacific Region: ', round(apac_region.loc[apac_region['is_aws']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about Google Cloud in Asia Pacific Region: ', round(apac_region.loc[apac_region['is_gcp']==1, 'reach'].sum() / reach_of_tweets * 100, 2))

Reach of tweets about Azure in Asia Pacific Region:  33.63
Reach of tweets about AWS in Asia Pacific Region:  40.79
Reach of tweets about Google Cloud in Asia Pacific Region:  25.58


In [79]:
emea_region = df_with_countries.loc[df_with_countries['Region']=='EMEA']

reach_of_tweets = (emea_region.loc[emea_region['is_azure']==1, 'reach'].sum() + emea_region.loc[emea_region['is_aws']==1, 'reach'].sum() + emea_region.loc[emea_region['is_gcp']==1, 'reach'].sum())

print('Reach of tweets about Azure in EMEA Region: ', round(emea_region.loc[emea_region['is_azure']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about AWS in EMEA Region: ', round(emea_region.loc[emea_region['is_aws']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about Google Cloud in EMEA Region: ', round(emea_region.loc[emea_region['is_gcp']==1, 'reach'].sum() / reach_of_tweets * 100, 2))

Reach of tweets about Azure in EMEA Region:  35.74
Reach of tweets about AWS in EMEA Region:  38.02
Reach of tweets about Google Cloud in EMEA Region:  26.24


In [80]:
latam_region = df_with_countries.loc[df_with_countries['Region']=='LATAM']

reach_of_tweets = (latam_region.loc[latam_region['is_azure']==1, 'reach'].sum() + latam_region.loc[latam_region['is_aws']==1, 'reach'].sum() + latam_region.loc[latam_region['is_gcp']==1, 'reach'].sum())

print('Reach of tweets about Azure in Latin America Region: ', round(latam_region.loc[latam_region['is_azure']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about AWS in Latin America Region: ', round(latam_region.loc[latam_region['is_aws']==1, 'reach'].sum() / reach_of_tweets * 100, 2))
print('Reach of tweets about Google Cloud in Latin America Region: ', round(latam_region.loc[latam_region['is_gcp']==1, 'reach'].sum() / reach_of_tweets * 100, 2))

Reach of tweets about Azure in Latin America Region:  55.27
Reach of tweets about AWS in Latin America Region:  37.84
Reach of tweets about Google Cloud in Latin America Region:  6.89


In [81]:
noam_region = df_with_countries.loc[(df_with_countries['Region']=='NOAM') & ((df_with_countries['is_aws']==1) | (df_with_countries['is_azure']==1) | (df_with_countries['is_gcp']==1))]

noam_region_sorted = noam_region.sort_values(by=['reach'], ascending=False)[:200]

print('Out of 200 tweets with highest reach, ', noam_region_sorted['is_aws'].sum(), ' are about AWS')
print('Out of 200 tweets with highest reach, ', noam_region_sorted['is_azure'].sum(), ' are about Azure')
print('Out of 200 tweets with highest reach, ', noam_region_sorted['is_gcp'].sum(), ' are about GCP')

Out of 200 tweets with highest reach,  143  are about AWS
Out of 200 tweets with highest reach,  23  are about Azure
Out of 200 tweets with highest reach,  42  are about GCP


We observe that tweets by AWS have a much greater visibility than its competitors although when assessing the number of tweets per se it was behind Azure.

Moreover, if we deep dive in the specific account that drives the AWS visibility we see that it was created at 2009-08-18. This account is the official account of AWS at tweeter and it has such a high visibility since it is being followed by 2M people

In [82]:
df_with_countries.loc[df_with_countries['usercreatedts']=='2009-08-18 19:52:16.000000'][:1]

Unnamed: 0,hashed_userid,following,followers,totaltweets,usercreatedts,tweetid,tweetcreatedts,retweetcount,favorite_count,original_tweet_id,tweetcreatedts_new,tweetcreatedts_date,country,clean_text,Region,is_aws,is_azure,is_gcp,reach
89,3284594507786650753,975,2014973,36321,2009-08-18 19:52:16.000000,1518228625895706624,2022-04-24 14:01:26.000000,39,339,0,2022-04-24 14:01:26,2022-04-24,United States of America,skill frontrunner worthy new deepracer arcade ...,NOAM,1,0,0,201622.4


Breakdown of the top 200 tweets with the highest visibility by Cloud Vendor

In [83]:
noam_region_sorted.loc[noam_region_sorted['is_aws']==1, 'usercreatedts'].value_counts()

2009-08-18 19:52:16.000000    68
2006-12-07 17:03:09.000000    42
2018-02-27 20:10:14.000000     7
2012-03-23 16:35:17.000000     6
2009-04-25 12:45:16.000000     4
2007-12-14 16:06:10.000000     3
2008-05-27 20:11:54.000000     2
2017-01-10 19:27:47.000000     2
2009-01-14 17:13:48.000000     1
2008-01-23 22:11:25.000000     1
2007-04-27 18:43:50.000000     1
2009-04-06 16:10:33.000000     1
2008-06-09 23:05:05.000000     1
2011-10-19 13:01:42.000000     1
2008-07-30 15:55:46.000000     1
2008-05-08 21:46:41.000000     1
2008-12-29 21:44:27.000000     1
Name: usercreatedts, dtype: int64

In [84]:
noam_region_sorted.loc[noam_region_sorted['is_azure']==1, 'usercreatedts'].value_counts()

2009-04-25 12:45:16.000000    8
2012-03-23 16:35:17.000000    8
2008-05-27 20:11:54.000000    2
2008-04-03 12:33:18.000000    1
2009-02-27 18:02:56.000000    1
2007-05-08 16:28:38.000000    1
2009-01-14 17:13:48.000000    1
2007-04-27 18:43:50.000000    1
Name: usercreatedts, dtype: int64

In [85]:
noam_region_sorted.loc[noam_region_sorted['is_gcp']==1, 'usercreatedts'].value_counts()

2012-03-23 16:35:17.000000    17
2009-04-25 12:45:16.000000    16
2007-04-14 22:48:21.000000     3
2009-06-25 20:55:34.000000     2
2012-08-16 23:16:47.000000     2
2009-01-02 23:15:02.000000     1
2015-09-24 11:47:59.000000     1
Name: usercreatedts, dtype: int64

### Topic Modeling

Finally, we are going to find out the main topics that users are tweeting about when it comes to Cloud Vendors in the North America region

In [86]:
#Get tweets about AWS
noam_region_aws_cloud = df_with_countries.loc[(df_with_countries['Region']=='NOAM') & (df_with_countries['is_aws']==1)]

In [87]:
noam_region_aws_cloud.shape #We have 5197 tweets about AWS

(5197, 19)

In [88]:
noam_region_aws_cloud_train = noam_region_aws_cloud

Build a corpus dictionary comprised of 100000 words, filtering out words that appear less than 15 times as well as those with a frequency greater than 50%

In [89]:
from gensim import corpora, models
import gensim
corpus = []
for i in range(noam_region_aws_cloud_train.shape[0]):
    try:    
        corpus.append(noam_region_aws_cloud_train['clean_text'].iloc[i].split())
    except Exception as e:
        print(e)
        print(noam_region_aws_cloud_train['clean_text'].iloc[i], i)
        continue
    
dct = corpora.Dictionary(corpus)

In [90]:
dct.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [91]:
bow_corpus = [dct.doc2bow(doc) for doc in corpus]

In [92]:
len(bow_corpus)

5197

Tf-Idf or Term Frequency over Inverse document frequency is a metric assigning weight to the various words af the tweet acccording to their importance in a particular tweet. The whole idea behind tf-idf is that if a word that in general does not appear too often in all of the tweets, appears in a particular tweet then it should be important for the particular tweet

In [93]:
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

We use Latent Dirichlet Analysis to generate the various topics. We can see the various topics that are getting generated alongside the words that mainly drive each topic

In [94]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=4, id2word=dct, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.010*"service" + 0.009*"cloud" + 0.009*"access" + 0.009*"aws" + 0.009*"today" + 0.009*"want" + 0.009*"available" + 0.008*"inbox" + 0.008*"chat" + 0.008*"social"
Topic: 1 Word: 0.012*"cloud" + 0.012*"aws" + 0.012*"service" + 0.009*"learn" + 0.009*"amazon" + 0.008*"hacking" + 0.007*"free" + 0.007*"support" + 0.006*"instance" + 0.006*"data"
Topic: 2 Word: 0.011*"rt" + 0.010*"follow" + 0.010*"aws" + 0.008*"new" + 0.008*"cloud" + 0.006*"service" + 0.006*"build" + 0.006*"built" + 0.005*"data" + 0.005*"help"
Topic: 3 Word: 0.011*"aws" + 0.008*"security" + 0.007*"data" + 0.007*"cloud" + 0.007*"learn" + 0.006*"learning" + 0.006*"work" + 0.006*"week" + 0.006*"technology" + 0.006*"trying"


For a better visualization of the various topics that LDA has proposed we are going to take the first 200 tweets and print them in their respective topic

In [95]:
bow_corpus_test = bow_corpus[:200]

In [102]:
new_df = pd.merge(noam_region_aws_cloud_train, df[['text', 'tweetid']], how='left', on ='tweetid')

In [104]:
topics = []
score = []
sentence = []
counter = 0 
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 0:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

☀️ Don't miss our on getting free access to our platform - No credit card required! 
🎯 Hurry up, just a few days left! Refresh your tech skills and take your career to the next level.
👉 https://t.co/NgLLyMdvUN

#AWS #Azure #GCP #AWSCertified #AzureCertified https://t.co/YmaviWFIgG 
 
 ------------------ 
 

Today's Most popular #Cloud Headlines #Openstack #AWS https://t.co/331xhP1D0S 
 
 ------------------ 
 

Are your #MachineLearning skills frontrunner worthy? 🏁🏎 ☁️

With the new #AWS DeepRacer Arcade, created in collaboration with @F1, you can test your driving skills against the clock. Find an AWS Summit near you, and go for a spin. https://t.co/0na0wRf6lN #Developer #F1 https://t.co/t7aga9LIIF 
 
 ------------------ 
 

Hiring a full-time Senior Python Developer in Nashville, TN. #python #pythondeveloper #django #aws https://t.co/CtFb6rfbKO 
 
 ------------------ 
 

awscloud: Are your #MachineLearning skills frontrunner worthy? 🏁🏎 ☁️

With the new #AWS DeepRacer Arcade, created i

In [105]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 1:
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        counter += 1
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

#AWS announces EKS Blueprints,  a collection of Infrastructure as Code (IaC) modules that configure &amp; deploy consistent, batteries-included EKS clusters. With support for Terraform &amp; CDK, partner products &amp; self managed add-ons, its a cool way to build https://t.co/yut1iU8yj4 
 
 ------------------ 
 

#AWS &amp; @Splunk are celebrating 10 years of strategic collaboration! If you want better visibility &amp; control while delivering apps faster, AWS + Splunk are here to help: https://t.co/6mtXmKyljK https://t.co/TISmryilcT 
 
 ------------------ 
 

Secure and affordable cloud-native #backup for #AWS #EC2 instances. Backup 10 EC2 instances for free. Download &amp; Get started now! https://t.co/DHfhlbIRmb
#AWSbackup #EC2instances #dataprotection https://t.co/HS0j03tCD3 
 
 ------------------ 
 

This certification is essential in laying the groundwork for establishing centralized governance and building event-driven security architectures that will help federal agencies like

In [106]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 2:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

Senior Architect - Permanent/Remote anywhere in Canada (Vancity based) - Golang, SaaS/PaaS/IaaS, DevSecOps, scale and build new features for visionary product - kbarr@teemagroup.com #softwaredesign #Golang #softwarearchitect #remotejobs #backendengineer #Azure #AWS #gcpcloud 
 
 ------------------ 
 

See a request in a forest of logs.

Quickly and easily combine multiple log groups into a unified view and filter by request ID to see a single request across multiple AWS lambdas and services.

https://t.co/aqdmMHbrEO

#serverless #aws #lambda #148 
 
 ------------------ 
 

Senior Architect - Permanent/Remote anywhere in Canada (Vancity based) - Golang, SaaS/PaaS/IaaS, DevSecOps, scale and build new features for visionary product - kbarr@teemagroup.com #softwaredesign #Golang #softwarearchitect #remotejobs #backendengineer #Azure #AWS #gcpcloud 
 
 ------------------ 
 

I will be at the #NABshow. Would love to connect, share notes/trends on the world of #streaming and the latest with @

In [107]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 3:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

AWS CMO Rachel Thornton on the future of customer-obsessed marketing in 2022 https://t.co/AYxZxuAJNz via @VentureBeat  #AWS #B2B 
 
 ------------------ 
 

#amazonwebservices has a many beautiful #pins and #stickers ! Strongly suggest you attend their #Summits ! See more at https://t.co/V8eLfEJmGw

e.g. #awssummit2022  #London April 27, 2022, then AWS Summit #Madrid May 04 - May 05, 2022, then #Korea, #Stock…https://t.co/UV6hulu91N 
 
 ------------------ 
 

@TammyCloudLover Oh trust me that’s never happening. I’m in the lab currently #aws 
 
 ------------------ 
 

💭 #AWS #Aurora Serverless V2 now has public access (V1 only allowed access from within a VPC). Now you can have serverless &amp; fully-managed benefits (cost savings + no fear of db size limits... i think) while doing something like creating a ton of dbs per #git branch

#database https://t.co/8WoXs90gO2 
 
 ------------------ 
 

It’s test time #aws as soon as I’m 100% I’m getting to that bag 
 
 ------------------ 
 

#Se

We perform the same analysis for Azure

In [108]:
noam_region_azure_cloud = df_with_countries.loc[(df_with_countries['Region']=='NOAM') & (df_with_countries['is_azure']==1)]

In [109]:
noam_region_azure_cloud.shape

(9961, 19)

In [110]:
noam_region_azure_cloud_train = noam_region_azure_cloud

In [111]:
corpus = []
for i in range(noam_region_azure_cloud_train.shape[0]):
    try:    
        corpus.append(noam_region_azure_cloud_train['clean_text'].iloc[i].split())
    except Exception as e:
        print(e)
        print(noam_region_azure_cloud_train['clean_text'].iloc[i], i)
        continue
    
dct = corpora.Dictionary(corpus)

In [112]:
dct.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [113]:
bow_corpus = [dct.doc2bow(doc) for doc in corpus]

In [114]:
len(bow_corpus)

9961

In [115]:
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [116]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=4, id2word=dct, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.015*"azure" + 0.010*"learn" + 0.010*"data" + 0.008*"team" + 0.007*"update" + 0.007*"service" + 0.007*"rt" + 0.007*"new" + 0.007*"got" + 0.006*"say"
Topic: 1 Word: 0.012*"troll" + 0.011*"azure" + 0.010*"learn" + 0.009*"one" + 0.008*"hear" + 0.008*"microsoft" + 0.008*"business" + 0.007*"would" + 0.007*"service" + 0.007*"never"
Topic: 2 Word: 0.013*"access" + 0.012*"like" + 0.010*"gifted" + 0.009*"run" + 0.009*"see" + 0.008*"get" + 0.008*"look" + 0.008*"vms" + 0.007*"azure" + 0.007*"includes"
Topic: 3 Word: 0.029*"troll" + 0.024*"verified" + 0.011*"cloud" + 0.009*"help" + 0.009*"one" + 0.008*"love" + 0.008*"azure" + 0.008*"today" + 0.006*"need" + 0.006*"fact"


In [117]:
bow_corpus_test = bow_corpus[:200]

In [118]:
new_df = pd.merge(noam_region_azure_cloud, df[['text', 'tweetid']], how='left', on ='tweetid')

In [119]:
topics = []
score = []
sentence = []
counter = 0 
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 0:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

Passed my Azure Fundamentals exam this past Wednesday. Excited to go down this path. #cloudengineer #microsoftazure #azure https://t.co/LCzfSthMVX 
 
 ------------------ 
 

AlpsLogic IT Solutions has experience &amp; expertise in a broad range of Microsoft technologies and languages 👍 #USA #Canada #hustle #business #entrepreneur #motivation #remote #nomad #angular #Azure #SharePoint #Dotnet #MVC #Microsoft #Technologies https://t.co/Uhx1OM0aE8 
 
 ------------------ 
 

How Companies Are Leveraging Microsoft #Azure’s #IoT Hub. (IoT For All) #Cloud #IoT #IoTPL #IoTCL  #IoTPractioner #IoTCommunity @IoTcommunity @IoTchannel https://t.co/diIKatGIRU https://t.co/NbbXVD8QqD 
 
 ------------------ 
 

This week's Azure Infrastructure Update is up! (from IRONMAN Texas lol).

https://t.co/6gjN6dgV04

#azure 
 
 ------------------ 
 

Introduction to CosmosDB Security and Using #Azure #CosmosDB REST API with POSTMAN https://t.co/xJasWpUi8P 
 
 ------------------ 
 

How to draw a #spatial polyg

In [120]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 1:
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        counter += 1
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

What's the best system for managing and monitoring #data? Comment if you think you know.     For 💡 insight, watch this video showing @Microsoft #Azure tools and processes i that tackle malicious #security threats in real-time. https://t.co/CcbxZ24E5E 
 
 ------------------ 
 

Did you know there are 2 different connectivity modes in #Azure #CosmosDB? https://t.co/elOEQ7kbJr 
 
 ------------------ 
 

Time is of the essence when it comes to protecting your clients’ critical Azure workloads, with hourly RPO out of the box, you can ensure complete protection. Learn more about Datto Continuity for Microsoft #Azure: https://t.co/PQAdHOelyv https://t.co/kOMGfRHr33 
 
 ------------------ 
 

Aufsite delivers custom cloud solutions for your business. Cloud computing has several advantages for businesses. Contact us now to learn more!
☎️ (732) 234-1020
📤 sales@aufsite.com
🌐 https://t.co/DXUZsvRXPz
#cloudcomputing #cloud #cloudservices #aws #azure #microsoft https://t.co/BUca0RTyiv 
 
 ---------

In [121]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 2:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

Senior Architect - Permanent/Remote anywhere in Canada (Vancity based) - Golang, SaaS/PaaS/IaaS, DevSecOps, scale and build new features for visionary product - kbarr@teemagroup.com #softwaredesign #Golang #softwarearchitect #remotejobs #backendengineer #Azure #AWS #gcpcloud 
 
 ------------------ 
 

☀️ Don't miss our on getting free access to our platform - No credit card required! 
🎯 Hurry up, just a few days left! Refresh your tech skills and take your career to the next level.
👉 https://t.co/NgLLyMdvUN

#AWS #Azure #GCP #AWSCertified #AzureCertified https://t.co/YmaviWFIgG 
 
 ------------------ 
 

Azure Arc + automation: step-by-step guides, code samples, and more…what are you waiting for?

#azure #automation #hybridcloud #multicloud  #developer #AzureArc https://t.co/UeuaWrzBAr 
 
 ------------------ 
 

Looking forward to collaborating with like minded people. We are #software solutions provider. 😀 #USA #Canada #hustle #business #entrepreneur #motivation #remote #nomad #angula

In [122]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 3:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

Transforming and Importing to #CosmosDB data by using #Azure CosmosDB Data Migration Tool. https://t.co/qO1bzd6qn0 
 
 ------------------ 
 

Who else needs help?
#Python #Roblox #IoT #IIoT #Azure #PyTorch #Cython #RStats #DotNet #CPP #Java #BTSV #ADA #CSharp #Flutter #SQL #TensorFlow #JavaScript #ReactJS #Serverless #Linux #Security #NFT #opensource
#AI #CSS #WordPress #HTML #DevOps 
 
 ------------------ 
 

Check it out!  #Water #Cloud #Sky #Weddingdress #Peopleonbeach #Dress #Bride #Blue #Peopleinnature #Azure https://t.co/PCjuXp7oiD 
 
 ------------------ 
 

Microsoft Azure AZ-305 Exam Preparation Guide with tips and resources is here to help you pass this certification -
https://t.co/k3TXxfPp8C
#cloud #certification #technology #tothecloud #microsoft #azure #azurearrchitect #cloudcomputing  https://t.co/k3TXxfPp8C 
 
 ------------------ 
 

Who else needs help?
#Python #Roblox #IoT #IIoT #Azure #PyTorch #Cython #RStats #DotNet #CPP #Java #BTSV #ADA #CSharp #Flutter #SQL #Tenso

Finally we perform topic modelling on Google Cloud

In [123]:
noam_region_google_cloud = df_with_countries.loc[(df_with_countries['Region']=='NOAM') & (df_with_countries['is_gcp']==1)]

In [124]:
noam_region_google_cloud.shape

(2381, 19)

In [125]:
noam_region_google_cloud_train = noam_region_google_cloud

In [126]:
corpus = []
for i in range(noam_region_google_cloud_train.shape[0]):
    try:    
        corpus.append(noam_region_google_cloud_train['clean_text'].iloc[i].split())
    except Exception as e:
        print(e)
        print(noam_region_google_cloud_train['clean_text'].iloc[i], i)
        continue
    
dct = corpora.Dictionary(corpus)

In [127]:
dct.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [128]:
bow_corpus = [dct.doc2bow(doc) for doc in corpus]

In [129]:
len(bow_corpus)

2381

In [130]:
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

In [131]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=4, id2word=dct, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.016*"ai" + 0.014*"search" + 0.013*"cc" + 0.012*"app" + 0.012*"sentient" + 0.011*"digital" + 0.011*"sale" + 0.011*"get" + 0.010*"apple" + 0.010*"u"
Topic: 1 Word: 0.028*"cloud" + 0.026*"consciousness" + 0.025*"buy" + 0.019*"account" + 0.019*"ai" + 0.017*"voice" + 0.015*"engineer" + 0.015*"sentient" + 0.015*"created" + 0.013*"number"
Topic: 2 Word: 0.018*"machine" + 0.016*"intelligence" + 0.015*"artificial" + 0.015*"data" + 0.014*"learning" + 0.014*"cloud" + 0.013*"image" + 0.011*"make" + 0.011*"market" + 0.010*"know"
Topic: 3 Word: 0.023*"new" + 0.018*"cloud" + 0.012*"ai" + 0.012*"one" + 0.011*"use" + 0.010*"engineer" + 0.010*"language" + 0.010*"platform" + 0.009*"technology" + 0.009*"read"


In [132]:
new_df = pd.merge(noam_region_google_cloud_train, df[['text', 'tweetid']], how='left', on ='tweetid')

In [133]:
bow_corpus_test = bow_corpus[:200]

In [134]:
topics = []
score = []
sentence = []
counter = 0 
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 0:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

Tutorial on Creating a Topical Internal Link Graph With Python: https://t.co/WMyN525eB4 
#seo #google #content #contentmarketing #searchengineoptimization #search 
#python #data #DataScience https://t.co/mEUmiXGtof 
 
 ------------------ 
 

Artificial Intelligence has become an integral part of our life from asking SIRI to turn on the alarm to asking Alexa to control our electronics. Education is one of those sectors where AI can play a huge role in transforming it!

#educhecked #checkedit #ai #education 
@GoogleAI https://t.co/PFwKYfmJLZ 
 
 ------------------ 
 

25 Open Source Projects by Google: https://t.co/4iC83pgJXM

PVPC #Arduino #programacion #HOLAMELONMAS #JuackeandoFinetwork05 #TierraAmarga26Abr #Ramen26A #Iot #IfYouWant #Options 
 
 ------------------ 
 

Struggling to get Google Cloud Digital Leader certified?

Prepare and pass your certification using my practice tests.

Get 40% off on my 5-star rated Google Cloud Digital Leader practice tests course on #udemy
#GoogleClo

In [135]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 1:
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        counter += 1
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

Architecting with Google Cloud #gcp #cloud #architecting #uber  https://t.co/7Y9iRxCIjP 
 
 ------------------ 
 

Cloud Titans Seek to Slash Carbon Impact: AWS, Meta, Google and Microsoft join 70 firms in iMasons Climate Accord to step up #datacenter carbon tracking.  @InfraMason #cloud #climate #sustainability https://t.co/B48ccyk5Qp 
 
 ------------------ 
 

Our 2022 Cloud MSP Vendor Map positions the top managed cloud service providers for #AWS, #Azure, and #GoogleCloud in an easy-to-understand Venn diagram.
https://t.co/cUNKL3ZqyR
#cloud #cloudcomputing #cloudmanagement 
 
 ------------------ 
 

Looking to consolidate tools in a single #cloudsecurity platform for #AWS, #Azure and #Google Cloud? Read this buyer's guide to select the Cloud Native Application Protection Platform (CNAPP) that works best for your business. #CISOs #infosec #cloud https://t.co/2dEpLZnZxb 
 
 ------------------ 
 

Pro tip: Bookmark this @googlecloud link to keep up with the latest and greatest tools, t

In [136]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 2:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

More delays to DoD cloud contract; #AWS, #Azure, #Google, #Oracle in the running

Read more about the contract in our latest article: https://t.co/wl7qXFBNbm

#cloud @awscloud @GoogleCloudTech @Oracle @Azure  #managedhosting https://t.co/7mCnJwV96M 
 
 ------------------ 
 

Our effective AI and ML delivers pre-built machine learning models and machine learning enabled software that can be applied across multiple industries.

Read more 👉 https://t.co/xb2cY6VDug @Electrifai

#GoogleCloudPlatform #SmartSchedulingtailorProgramming #ResolveMissedCharges 
 
 ------------------ 
 

Quick reminder: Data and Visualization event tomorrow:  #data #visualization #googlecloudplatform #bigquery #looker https://t.co/llageQDXUG 
 
 ------------------ 
 

Read about how to prep for any disruption or outages that are sure to happen. Cloud-based and hybrid models (done right) will limit downtime and ensure business continuity. 

https://t.co/pHYEAAHglg
#cloudoutage #GoogleCloud #GCP #disasterrecovery #b

In [137]:
topics = []
score = []
sentence = []
counter = 0
for i in range(len(bow_corpus_test)):
    pred = lda_model_tfidf[bow_corpus_test[i]]    
    maxim = -1
    topic = -1
    for el in pred:        
        if el[1] > maxim:
            maxim = el[1]
            topic = el[0]
    sentence.append(new_df['text'].iloc[i])
    score.append(maxim)
    topics.append(topic)
    if topic == 3:
        counter += 1
        print(new_df['text'].iloc[i], '\n \n ------------------ \n \n')
        #print("Sentence: {}\t \n Score: {}\t \n \n Topic: {}\t \n \n".format(new_df['text'].iloc[i], maxim, topic))
print(counter)

Check it out! Rocky Linux is now on Google Cloud with Enterprise Support. If you are accustomed to CentOS or RHEL, have a look at this great new option. #google #cloud #googlecloudplatform #linux #centos #rhel https://t.co/Wt9QOugLTZ 
 
 ------------------ 
 

I forgot who said you are NOT really Certified till you post it here at LinkedIn : ) 
#googlecloud #googlecloudplatform #googlecloudcertified #datascience #dataanalytics #linkedin https://t.co/NVMIgEvrYA 
 
 ------------------ 
 

Google Pixel unleashed a game-changing virtual experience where you can create, play, share &amp; celebrate your 2022 NBA Playoffs skills! 🏀 https://t.co/Q7JRuphsa5
•••
#appbarry #makeitappen #atlanta #softwaredevelopment #appdevelopment #aws #NBA #google #pixel #googlepixelarena https://t.co/5BzoI9nqsj 
 
 ------------------ 
 

DoiT is a sponsor of #KubeCon2022! Our cloud experts are excited to help you grow your cloud knowledge at our space in person and virtually in Valencia, Spain May 18th - 20th. 