# Exploratory Data Analysis

Exploratory data analysis (EDA) is a process of systematically examining and analyzing data in order to summarize its main characteristics, to find patterns and relationships within the data, and to identify any anomalies or irregularities. It is an iterative process that involves visualizing the data, applying statistical techniques, and using other tools and methods to gain insights and to discover new patterns and relationships in the data. EDA is typically used at the beginning of a data analysis project, before more formal methods are applied, in order to gain a better understanding of the data and to identify any important features or trends that may not be immediately apparent. It can help researchers and data scientists to form hypotheses and to generate ideas for further analysis.

In [1]:
# Load the libraries
import pandas as pd
import plotly.express as px

In [2]:
# Load the data
df = pd.read_hdf('./../../code/data/starbucks/data.h5', key='starbucks')

In [4]:
# Get the shape of the data
df.shape

(769439, 28)

There are 769439 rows and 28 columns in the data i.e. the data set contains information about 769439 tweets. The tweets thus scraped have other information associated with them such as the time on which they are posted, the conversation identifier that they are part of, the username of the user who posted the tweet, the number of likes of the tweet, number of quotes of tweet, etc.

In [5]:
# Get the number of duplicate tweets
df[df['tweet'].duplicated()].shape

(76652, 28)

There are 76652 duplicate observations in the data. It is better to remove them as duplicates do not provide any extra information. This duplication of data happens due to one tweet getting multiple identifiers. For some unknown reasons, Twitter can change the identifiers of the tweets.

In [6]:
# Get the number of blank tweets
df['tweet'].isna().sum()

4

There are 4 observations having no tweets at all. This can happen due to inefficiency during the scraping of the data.

In [7]:
# Get the number of unique usernames
df['username'].nunique()

524472

There are 524472 unique usernames who tweeted about Starbucks in the chosen period.

In [8]:
# Get the number of unique languages
df['language'].nunique()

65

The data contains tweets in 65 different languages. For our analysis, it is better to consider tweets posted in the English language only.

In [9]:
# Plot the distribution of top 10 languages
lang_df = pd.DataFrame(df['language'].value_counts()).reset_index().iloc[:10]
lang_df.columns = ['language', 'num_tweets']
fig = px.bar(lang_df, x='language', y='num_tweets', log_y=True, title="Distribution of languages in tweets")
fig.update_xaxes(title="Languages")
fig.update_yaxes(title="Number of tweets")
fig.show()

In [27]:
# Get the data having "starbucks" keyword in the usernames
starbucks_df = df.dropna(subset='username', axis=0)
starbucks_df = starbucks_df[starbucks_df['username'].str.contains('starbucks')]

In [32]:
# Print the username with "starbucks" substring
print(set(starbucks_df['username'].values))

{'starbucks__01', 'starbuckslay', 'matchstarbucks', 'starbucksssftw', 'starbucks4649', 'kfcmcdstarbucks', 'SCTstarbucks', 'starbucksperu', 'starbucks_es', 'starbucksteve9', 'Dutastarbucks', 'starbucksfan32', 'starbuckshomeme', 'starbucks9515', 'starbucksgirl51', 'marocstarbucks', 'oownerstarbucks', 'investarbucks', 'annannstarbucks', 'starbucks_j_cpg', 'emo_starbucks', 'starbuckspoho', 'starbucksilivri', 'everystarbucks_', 'starbucks_1996', 'jemaatstarbucks', 'pinkstarbucks_', 'starbucksph', 'wearegstarbucks', 'starbucksfloozy', 'starbucks1940', 'starbucks_item', 'starbucks_cstm', 'jiyasstarbucks', 'starbuckslabor'}


There are certain accounts that belong to the Starbucks company itself. The tweets from these accounts can not be used to analyze the perception of people about the brand. Hence, tweets from the official accounts of the company will be removed.
The following are the accounts whose tweets need to be removed from the data set:
1. starbucks_es - Official handle of Starbucks spain
2. starbucks_cstm - Bot for Starbucks collectibles
3. starbucks_j_cpg - Starbucks official account for CPG in Japan
4. starbuckspoho - Official handle of Starbucks Port Hope
5. starbucksph - Official handle of Starbucks Philippines
6. starbuckshomeme - Official handle of Starbucks At Home Arabia
7. starbucksperu - Official handle of Starbucks Peru

The descriptive statistics for the number of quotes, followers of the users, followings of the users and number of favorites is as follows:

In [11]:
df.describe()

Unnamed: 0,id,quote_count,replies_count,followers_count,following_count,favourites_count
count,769415.0,769414.0,769417.0,769417.0,769417.0,769417.0
mean,1.576728e+18,0.350595,0.730379,63389.31,1630.781,27702.6
std,7165636000000000.0,21.384164,14.810353,818153.2,9815.163,58334.23
min,1.152794e+18,0.0,0.0,0.0,0.0,0.0
25%,1.570202e+18,0.0,0.0,87.0,153.0,1317.0
50%,1.576381e+18,0.0,0.0,331.0,416.0,7984.0
75%,1.582854e+18,0.0,1.0,1153.0,1067.0,28623.0
max,1.589007e+18,7320.0,5934.0,60444890.0,1655061.0,3133858.0
