# Pandas Basics

Before we jump into data analysis, we will quickly cover some Pandas basics. Pandas is an external package that fits data into a tabular data structure called "DataFrame" which helps us manipulate the data simply using rows and columns instead of iterating through two-dimensional lists using loops or list comprehensions. If you are using Anaconda, you do not need to install it as it comes with it. While the course does not specifically foucs on Pandas itself, you are strongly advised to get yourself comfortable with Pandas since it is currently the industry standard, and you are very likely to face it again in the future. It has a lot of functionalities, too much to cover them all, but the basics are shown below. You can read [the documentation](https://pandas.pydata.org/docs/) for more information.

## Contents

* [Creating, importing, exporting, and displaying tabular data](#data)
* [Using rows](#rows)
* [Using columns](#columns)
* [Using cells](#cells)
* [Manipulation](#manipulation)
    * [Adding a new column](#new-column)
    * [Concatenation](#concatenation)
    * [Filtering](#filtering)
    * [Aggregation](#aggregation)
    * [Join](#join)
* [More information](#more)

## Creating, importing, exporting, and displaying tabular data <a class="anchor" id="data"></a>

A common way to create a DataFrame object from scratch is using a dictionary, but [it is not the only way to do it](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [1]:
import pandas as pd

our_dict = {"column_1": ["foo", "bar"], "column_2": [1, 2]}
dataset = pd.DataFrame(our_dict)

# We can use IPython's display() function to display the whole dataset (the middle 
# gets truncated by default if it is too long):
display(dataset)

# Well, we actually do not even need to explicitly call display(), this works too:
# dataset

Unnamed: 0,column_1,column_2
0,foo,1
1,bar,2


Many times, you will need to import a dataset. We will now import a dataset derived from [a public Kaggle dataset](https://www.kaggle.com/gpreda/covid19-tweets) and use it for the rest of the tutorial:

In [2]:
# We can import a CSV file using Pandas:
dataset = pd.read_csv("covid19_tweets_filtered.csv")

# There are so many parameters we can use while importing a CSV file. Visit 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html 
# to see them all.

# We can use head() to quickly look at the structure and the first few rows:
dataset.head()

# We can define the number of rows we want to see by providing an "n" parameter.
# This would retrieve the first 10 rows:
# dataset.head(n=10)

# This would retrieve all rows except the last one:
# dataset.head(n=-2)

# The opposite of head(), tail() would retrieve the last few sets:
# dataset.tail()

# The shape attribute of a DataFrame object returns the number of rows and number 
# of columns in this order:
# dataset.shape

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
1,ChennaiCityNow,,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False


To export our DataFrame object, we can use `to_csv()` function:

In [3]:
# dataset.to_csv('dataset.csv', index=False)
# We can remove index=False or set it to True in order to also export the row 
# names.

## Using rows <a class="anchor" id="rows"></a>

We can retrieve a specific row using `iloc[<row_index>]`:

In [4]:
# Retrieves the third (since indices start from 0):
dataset.iloc[2]

user_name                                        marc goovaerts🇪🇺🏳️‍🌈
user_location                                                Brussels
user_description    Progressive mind. Flemish. Into movies, politi...
user_created                                      2009-06-13 13:48:16
user_followers                                                    283
user_friends                                                     1432
user_favourites                                                  1546
user_verified                                                   False
date                                              2020-07-25 12:26:44
text                Second wave of #COVID19 in Flanders..back to m...
hashtags                                      ['COVID19', 'homework']
source                                            Twitter for Android
is_retweet                                                      False
Name: 2, dtype: object

However, retrieving a single row or column gives us a Series object. Series is like a one-dimensional version of DataFrame (similar to a vector in a matrix). Depending on what we want to achieve, this may not be what we need. To force a single row to be a DataFrame object, we can use double brackets, like `iloc[[<row_index>]]` or `iloc[[<row_name>]]`, which enforces having two dimensions.

In [5]:
# Retrieves the third row as DataFrame:
dataset.iloc[[2]]

# This is the long form (we will come to this later):
# dataset.iloc[[2],:]

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False


We can also retrieve row ranges using `iloc[<row_index_start>:<row_index_end>]` or specific rows using `iloc[[<row_index_1>,<row_index_2>,...]]`:

In [6]:
# Retrieves the rows 3-5 (their indices start from 0):
dataset.iloc[2:5]

# This would retrieve all rows starting from row 3:
# dataset.iloc[2:]

# This can be used to retrieve non-consecutive rows:
# dataset.iloc[[1,3,5]]

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False


Giving a range also forces it to return a DataFrame:

In [7]:
# This is a single row, but it is a DataFrame although we did not use double 
# brackets:
dataset.iloc[2:3]

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False


Our dataset had no row labels, and we did not specify row names while importing it. So, Pandas automatically assigned unique row labels that also look like row indices. The row labels that we see at each row are not indices.

We can also rename each row. Check how the labels of the first two rows change:

In [8]:
# "0" becomes "first" and "1" becomes "second:"
dataset.rename(index={0: "first", 1: "second"}).head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
first,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
second,ChennaiCityNow,,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False


If we wanted to make the row name change permanent, we would write `dataset = dataset.rename(index={0: "first", 1: "second"})` to re-assign this new version to the variable. We could also specify Pandas to overwrite the changes instead of returning a copy with the changes. To do so, we can use `inplace=True` as a parameter. So, `dataset.rename(index={0: "first", 1: "second"}, inplace=True)` has the same effect. `inplace` parameter is accepted in many Pandas methods.

Let us return to the labels. Sometimes we may prefer to locate rows using their labels instead of their indices. In that case, we can use `loc[]`. The same rules about getting a Series applies here as well.

In [9]:
# After changing the row labels, we can refer to the first row as "first" and 
# retrieve it as a DataFrame object:
dataset.rename(index={0: "first"}).loc[["first"]]

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
first,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False


Note that if we did not change the row's name, we could also simply use `dataset.loc[[0]]`, because "0" behaves like a label rather than index there. Otherwise, we cannot use indices with `loc[]` and labels with `iloc[]`.

We can remove specific rows using `drop()`:

In [10]:
# This removes the second and the fourth rows from the dataset:
dataset.drop([1, 3])

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False
5,WASH FOR HEALTH,,Tweets (by WHO team) in support of #washinhcf ...,2015-08-20 08:53:15,1657,823,2008,False,2020-07-25 12:26:02,Actionables for a healthy recovery from #COVID...,"['COVID19', 'climate']",Twitter for iPhone,False
6,N I C TA S H ♍,"Port Elizabeth, South Africa",I.G: @nictash_tash,2016-04-28 09:05:56,481,308,1120,False,2020-07-25 12:25:50,Volume for those at the back please. 🔊 #COVID1...,['COVID19'],Twitter for Android,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40731,rainmaker,,,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,"['COVID19', 'coronavirus', 'TrumpVirus']",Twitter for Android,False
40732,John Geer,,#StayAtHome #StayAtHomeSaveLifes #MaskUp \r\nF...,2020-04-18 01:55:14,61,168,10817,False,2020-08-29 19:44:34,Report #COVID19 outbreaks in K-12 schools here...,"['COVID19', 'CloseTheSchools', 'KeepTheSchools...",Twitter Web App,False
40733,Pris,T.O.,"A/V/L Techie, camera op. but twitter has becom...",2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],Twitter Web App,False
40734,Jason,Ontario,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False


## Using columns <a class="anchor" id="columns"></a>

It is possible to manipulate DataFrame columns as well.

Retrieving columns is similar. The only difference is now we need to use `iloc[:,<column_index>]` or `loc[:,<column_name>]`. Otherwise, the column index/name is confused with the row index/name.

In [11]:
# Retrieves the first column and forces it into DataFrame (notice the second set 
# of brackets around the column index):
dataset.iloc[:,[0]]

Unnamed: 0,user_name
0,Prathamesh Bendre
1,ChennaiCityNow
2,marc goovaerts🇪🇺🏳️‍🌈
3,Florian Bieber
4,Ms Paz
...,...
40731,rainmaker
40732,John Geer
40733,Pris
40734,Jason


Again, we can use `loc[]` to use column labels instead, but we actually have a shortcut this time:

In [12]:
# Retrieves the first column using its name and forces it into DataFrame:
dataset[["user_name"]]
# The output is the same with this:
dataset.loc[:,["user_name"]]

# Using a single set of brackets retrieves our column as Series:
# dataset["user_name"]
# Another option here is to use <DataFrame>.<column_name>:
# dataset.user_name

# Retrieveing the column as a list:
# dataset.user_name.tolist()

Unnamed: 0,user_name
0,Prathamesh Bendre
1,ChennaiCityNow
2,marc goovaerts🇪🇺🏳️‍🌈
3,Florian Bieber
4,Ms Paz
...,...
40731,rainmaker
40732,John Geer
40733,Pris
40734,Jason


We can take a subset of the columns using their names.

In [13]:
dataset[["text", "date", "user_name"]]

# We can also use iloc[] similar to the way we used it to retrieve specific rows.
# This would retrieve the same thing:
# dataset.iloc[:,[9,8,0]]

Unnamed: 0,text,date,user_name
0,Praying for good health and recovery of @Chouh...,2020-07-25 12:26:59,Prathamesh Bendre
1,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,2020-07-25 12:26:44,ChennaiCityNow
2,Second wave of #COVID19 in Flanders..back to m...,2020-07-25 12:26:44,marc goovaerts🇪🇺🏳️‍🌈
3,Holy water in times of #COVID19 https://t.co/Y...,2020-07-25 12:26:28,Florian Bieber
4,#FEMA acknowledges #PuertoRico lacks rebuilt h...,2020-07-25 12:26:21,Ms Paz
...,...,...,...
40731,@politvidchannel Just not the lives of #COVID1...,2020-08-29 19:44:35,rainmaker
40732,Report #COVID19 outbreaks in K-12 schools here...,2020-08-29 19:44:34,John Geer
40733,"we have reached 25mil cases of #covid19, world...",2020-08-29 19:44:23,Pris
40734,2020! The year of insanity! Lol! #COVID19 http...,2020-08-29 19:44:16,Jason


What if we have 100 columns and we need to retrieve 98 of them? Sometimes excluding certain columns can be more practical. To do so, we can use `drop()` again. As we saw, it also works with rows. Certain methods such as `drop()` can work with both rows and columns. In that case, they have an axis parameter that is set to "0" (rows) by default. To make them work with columns, we need to specify `axis=1`.

In [14]:
# Retrieves all columns except user_name and user_location:
dataset.drop(['user_name', 'user_location'], axis=1)

Unnamed: 0,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
1,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False
...,...,...,...,...,...,...,...,...,...,...,...
40731,,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,"['COVID19', 'coronavirus', 'TrumpVirus']",Twitter for Android,False
40732,#StayAtHome #StayAtHomeSaveLifes #MaskUp \r\nF...,2020-04-18 01:55:14,61,168,10817,False,2020-08-29 19:44:34,Report #COVID19 outbreaks in K-12 schools here...,"['COVID19', 'CloseTheSchools', 'KeepTheSchools...",Twitter Web App,False
40733,"A/V/L Techie, camera op. but twitter has becom...",2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],Twitter Web App,False
40734,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False


We can rename columns too. To do so, we can use `rename()` with `columns` parameter:

In [15]:
# Renames user_name column to "user"
dataset.rename(columns={"user_name": "user"}).head()

# We can combine renaming rows and columns together. This renames the first 
# column and the first row:
# dataset.rename(columns={"user_name": "user"}, index={0: "first"}).head()

Unnamed: 0,user,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
1,ChennaiCityNow,,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False


## Using cells <a class="anchor" id="cells"></a>

Sometimes we do not want to deal with whole rows or columns. We can access specific cells using row/column IDs or names.

As we saw, `iloc[<row_index>,:]` retrieves a row while `iloc[:,<column_index>]` retrieves a column. `:` is used there to create [a slice](https://python-reference.readthedocs.io/en/latest/docs/brackets/slicing.html). Since the slice is without a start or an end, it simply retrieves every row or column by itself. and using it by itself simply retrieves every row or column. Naturally, specifying both the row and the column together (`iloc[<row_index>,<column_index>]`) retrieves a cell.

In [16]:
# This retrieves the third row's second column (since indices start from 0):
dataset.iloc[2,1]

'Brussels'

Therefore, we can also use index ranges to slice the dataset using `[<row_index_start>:<row_index_end>, <column_index_start>:<column_index_end>]` format or specific rows and columns using lists:

In [17]:
# This retrieves the first four columns from the third, the fourth, and the 
# fifth rows:
dataset.iloc[2:5,0:4]

# This would retrieve the same thing:
# dataset.iloc[[2,3,4],[0,1,2,3]]

Unnamed: 0,user_name,user_location,user_description,user_created
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10
4,Ms Paz,United States,,2019-09-15 18:10:09


In [18]:
# This retrieves the specified columns for the specified rows:
dataset.iloc[[0,2],[1,3]]

Unnamed: 0,user_location,user_created
0,,2015-04-25 08:15:41
2,Brussels,2009-06-13 13:48:16


Sometimes, using labels can be easier. If we want to access the user_name column of the first row, we can use this:

In [19]:
dataset.loc[0,"user_name"]

# An alternative is to firstly get the column and then get the row from the 
# retrieved Series object:
# dataset["user_name"][0]

'Prathamesh Bendre'

Note that we cannot use indices with `loc[]` as it only accepts labels. The reason "0" works there is because our dataset does not have row names itself, so Pandas assigns unique row IDs as labels when we import it. "0" behaves like a label instead of an index there. If we change the first row's label from "0" to "first," `loc[]` would only work like this: 

In [20]:
dataset.rename(index={0:"first"}).loc["first","user_name"]
# rename() can be also used to rename columns through passing a "columns" 
# dictionary. We chain multiple methods for efficiency here. It first renames 
# the first row's label, then retrieves user_name of the first row.

# This is how the alternative looks like when we change the row name:
# dataset.rename(index={0:"first"})["user_name"]["first"]

'Prathamesh Bendre'

What if we want to use labels and indices together to retrieve cells? It used to be a method to do this, but it is now deprecated, and we need to combine `loc[]` with `iloc[]`. Let us rename the first row again, get the first row using its label, and then use an index to get the corresponding column. The good side of this is that it prevents confusion because we can never combine indices with labels.

In [21]:
dataset.rename(index={0:"first"}).loc[["first",]].iloc[:,0]
# Since we are retrieveing the first row as a DataFrame object by using two sets 
# of brackets, we can change their order:
# dataset.rename(index={0:"first"}).iloc[:,0].loc[["first",]]

# It would be also possible to retrieve the row as Series and simply use [0] to 
# get the first column from it:
# Using a single value in loc[] and iloc[] means we only specify the row.
dataset.rename(index={0:"first"}).loc["first"][0] 

'Prathamesh Bendre'

Unfortunately, Pandas is not as intuitive as [R](https://www.r-project.org/) when it comes to data exploration.

## Manipulation <a class="anchor" id="manipulation"></a>

There are many operations you can do on a DataFrame object, but you will probably mostly use these.

### Adding a new column <a class="anchor" id="new-column"></a>

To add a new column, we can assign the column values to a non-existing column of the existing dataset like this:

In [22]:
# Basically creates a unique ID for each row by simply getting the row count 
# using len(dataset):
dataset["new_column"] = range(0,len(dataset))

# assign() can be used as well for this:
# dataset = dataset.assign(new_column = range(0,len(dataset)))

display(dataset)

# Removing new_column column in place after displaying it:
dataset.drop("new_column", axis=1, inplace=True)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet,new_column
0,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False,0
1,ChennaiCityNow,,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False,1
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False,2
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False,3
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40731,rainmaker,,,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,"['COVID19', 'coronavirus', 'TrumpVirus']",Twitter for Android,False,40731
40732,John Geer,,#StayAtHome #StayAtHomeSaveLifes #MaskUp \r\nF...,2020-04-18 01:55:14,61,168,10817,False,2020-08-29 19:44:34,Report #COVID19 outbreaks in K-12 schools here...,"['COVID19', 'CloseTheSchools', 'KeepTheSchools...",Twitter Web App,False,40732
40733,Pris,T.O.,"A/V/L Techie, camera op. but twitter has becom...",2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],Twitter Web App,False,40733
40734,Jason,Ontario,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False,40734


Sometimes we may want to create new columns based on other columns. For example, we can calculate the account age in days when these tweets are published by subtracting "user_created" from "date" column and store it in "account_age_days" column. Note that we need to firstly convert dates that are in a string format to datetime format. To do so, we can use `to_datetime()` function.

In [23]:
dataset["account_age_days"] = (pd.to_datetime(dataset.date) - pd.to_datetime(dataset.user_created)).dt.days
# After making the subtraction, we can access the values in days using 
# ".dt.days"

display(dataset[["user_name", "user_created", "date", "account_age_days"]])

# Removing account_age_days column in place after displaying it
dataset.drop("account_age_days", axis=1, inplace=True)

Unnamed: 0,user_name,user_created,date,account_age_days
0,Prathamesh Bendre,2015-04-25 08:15:41,2020-07-25 12:26:59,1918
1,ChennaiCityNow,2009-04-26 09:38:11,2020-07-25 12:26:44,4108
2,marc goovaerts🇪🇺🏳️‍🌈,2009-06-13 13:48:16,2020-07-25 12:26:44,4059
3,Florian Bieber,2009-06-18 09:46:10,2020-07-25 12:26:28,4055
4,Ms Paz,2019-09-15 18:10:09,2020-07-25 12:26:21,313
...,...,...,...,...
40731,rainmaker,2017-08-16 22:12:17,2020-08-29 19:44:35,1108
40732,John Geer,2020-04-18 01:55:14,2020-08-29 19:44:34,133
40733,Pris,2008-12-31 16:16:12,2020-08-29 19:44:23,4259
40734,Jason,2011-12-21 04:41:30,2020-08-29 19:44:16,3174


### Applying a function to a column <a class="anchor" id="map"></a>

To quickly map all values of a specific column to new values, we can use `map()` function to apply a function to all values under that column:

In [24]:
# We can also define a lambda function and use it like this. This makes all text 
# values lowercase:
text_lowercase = dataset["text"].map(lambda x: x.lower())
# We could assign this back to dataset["text"] to actually update the dataset.
# Note that if we want to apply a function to a DataFrame object instead of a 
# Series object, we should use apply() instead.

text_lowercase

0        praying for good health and recovery of @chouh...
1        july 25 #covid19 update\r\n#tamilnadu - 6988\r...
2        second wave of #covid19 in flanders..back to m...
3        holy water in times of #covid19 https://t.co/y...
4        #fema acknowledges #puertorico lacks rebuilt h...
                               ...                        
40731    @politvidchannel just not the lives of #covid1...
40732    report #covid19 outbreaks in k-12 schools here...
40733    we have reached 25mil cases of #covid19, world...
40734    2020! the year of insanity! lol! #covid19 http...
40735    more than 1,200 students test positive for #co...
Name: text, Length: 40736, dtype: object

### Concatenation <a class="anchor" id="concatenation"></a>

We can concatenate two DataFrame objects by their rows:

In [25]:
# Since we are concatenating it by itself by their rows, we get the same columns 
# while doubling each row:
pd.concat([dataset, dataset])

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
1,ChennaiCityNow,,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40731,rainmaker,,,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,"['COVID19', 'coronavirus', 'TrumpVirus']",Twitter for Android,False
40732,John Geer,,#StayAtHome #StayAtHomeSaveLifes #MaskUp \r\nF...,2020-04-18 01:55:14,61,168,10817,False,2020-08-29 19:44:34,Report #COVID19 outbreaks in K-12 schools here...,"['COVID19', 'CloseTheSchools', 'KeepTheSchools...",Twitter Web App,False
40733,Pris,T.O.,"A/V/L Techie, camera op. but twitter has becom...",2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],Twitter Web App,False
40734,Jason,Ontario,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False


We can also concatenate two DataFrame objects by their columns using `axis=1`:

In [26]:
# Since we are concatenating it by itself by their columns, we get the same rows 
# while doubling each column:
pd.concat([dataset, dataset], axis=1)

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,...,user_created.1,user_followers.1,user_friends.1,user_favourites.1,user_verified.1,date.1,text.1,hashtags,source,is_retweet
0,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,...,2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
1,ChennaiCityNow,,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,...,2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,...,2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,...,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40731,rainmaker,,,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,...,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,"['COVID19', 'coronavirus', 'TrumpVirus']",Twitter for Android,False
40732,John Geer,,#StayAtHome #StayAtHomeSaveLifes #MaskUp \r\nF...,2020-04-18 01:55:14,61,168,10817,False,2020-08-29 19:44:34,Report #COVID19 outbreaks in K-12 schools here...,...,2020-04-18 01:55:14,61,168,10817,False,2020-08-29 19:44:34,Report #COVID19 outbreaks in K-12 schools here...,"['COVID19', 'CloseTheSchools', 'KeepTheSchools...",Twitter Web App,False
40733,Pris,T.O.,"A/V/L Techie, camera op. but twitter has becom...",2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",...,2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],Twitter Web App,False
40734,Jason,Ontario,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False


### Filtering <a class="anchor" id="filtering"></a>

We can easily remove rows that have empty cells (listwise deletion) using `dropna()`:

In [27]:
# Retrieves rows that do not have any missing values in any of the columns:
dataset.dropna()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
6,N I C TA S H ♍,"Port Elizabeth, South Africa",I.G: @nictash_tash,2016-04-28 09:05:56,481,308,1120,False,2020-07-25 12:25:50,Volume for those at the back please. 🔊 #COVID1...,['COVID19'],Twitter for Android,False
8,CARLINO,"New Orleans, LA",IG Carlino213: Alumni LA Southwest College: Pr...,2015-04-19 22:07:53,11,3,133,False,2020-07-25 12:25:29,Crazy that the world has come to this but as A...,['covid19'],Twitter for iPhone,False
10,ᏉᎥ☻լꂅϮ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:25:24,I miss isopropyl alcohol so much!!!! Ethanol i...,['COVID19'],Twitter for iPhone,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40729,ABS-CBN News,"Manila, Philippines","Stories, video, and multimedia for Filipinos w...",2008-08-16 10:09:33,7062635,1069,1169,True,2020-08-29 19:45:00,DOH warns public vs disclosing coronavirus pat...,['COVID19'],TweetDeck,False
40730,Norwalk Library CT,"Norwalk, CT",Norwalk Public Library is a public library loc...,2010-03-11 20:03:25,3782,1912,2526,False,2020-08-29 19:44:48,What are those #library #cats doing now? #COV...,"['library', 'cats', 'COVID19', 'pandemic', 'co...",Twitter Web App,False
40733,Pris,T.O.,"A/V/L Techie, camera op. but twitter has becom...",2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],Twitter Web App,False
40734,Jason,Ontario,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False


Using `axis=1`, `dropna()` can be also used to columns that have missing values:

In [28]:
# Retrieves columns that do not have any missing values in any of the rows:
dataset.dropna(axis=1)

Unnamed: 0,user_name,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,is_retweet
0,Prathamesh Bendre,2015-04-25 08:15:41,25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",False
1,ChennaiCityNow,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",False
2,marc goovaerts🇪🇺🏳️‍🌈,2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",False
3,Florian Bieber,2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],False
4,Ms Paz,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",False
...,...,...,...,...,...,...,...,...,...,...
40731,rainmaker,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,"['COVID19', 'coronavirus', 'TrumpVirus']",False
40732,John Geer,2020-04-18 01:55:14,61,168,10817,False,2020-08-29 19:44:34,Report #COVID19 outbreaks in K-12 schools here...,"['COVID19', 'CloseTheSchools', 'KeepTheSchools...",False
40733,Pris,2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],False
40734,Jason,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],False


Instead of providing specific row indices, we can also provide conditions. This one retrieves tweets that belong to users who have at least 100 followers:

In [29]:
dataset[dataset.user_followers >= 100]

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
1,ChennaiCityNow,,Individual tweeting about significant happenin...,2009-04-26 09:38:11,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",2009-06-13 13:48:16,283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",2009-06-18 09:46:10,18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,2019-09-15 18:10:09,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False
5,WASH FOR HEALTH,,Tweets (by WHO team) in support of #washinhcf ...,2015-08-20 08:53:15,1657,823,2008,False,2020-07-25 12:26:02,Actionables for a healthy recovery from #COVID...,"['COVID19', 'climate']",Twitter for iPhone,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40730,Norwalk Library CT,"Norwalk, CT",Norwalk Public Library is a public library loc...,2010-03-11 20:03:25,3782,1912,2526,False,2020-08-29 19:44:48,What are those #library #cats doing now? #COV...,"['library', 'cats', 'COVID19', 'pandemic', 'co...",Twitter Web App,False
40731,rainmaker,,,2017-08-16 22:12:17,709,1158,95006,False,2020-08-29 19:44:35,@politvidchannel Just not the lives of #COVID1...,"['COVID19', 'coronavirus', 'TrumpVirus']",Twitter for Android,False
40733,Pris,T.O.,"A/V/L Techie, camera op. but twitter has becom...",2008-12-31 16:16:12,251,160,627,False,2020-08-29 19:44:23,"we have reached 25mil cases of #covid19, world...",['covid19'],Twitter Web App,False
40734,Jason,Ontario,When your cat has more baking soda than Ninja ...,2011-12-21 04:41:30,150,182,7295,False,2020-08-29 19:44:16,2020! The year of insanity! Lol! #COVID19 http...,['COVID19'],Twitter for Android,False


We can chain different processes together. For example, we can retrieve non-retweet tweets that also belong to people who have at least 100 followers. From that, we can retrieve only "user_name" and "user_followers" columns, and then we can also sort them by the follower count in descending order. The step by step chaining process is shown below:

In [30]:
# Tweets that are not retweets:
# dataset[(dataset.is_retweet == False)]

# Also requiring them to belong to users with at least 100 followers:
# dataset[(dataset.is_retweet == False) & (dataset.user_followers >= 100)]

# Retrieveing user_name and user_follower columns:
# dataset[(dataset.is_retweet == False) & (dataset.user_followers >= 100)][["user_name","user_followers"]]

# Sorting them:
dataset[(dataset.is_retweet == False) & (dataset.user_followers >= 100)][["user_name","user_followers"]].sort_values(by=['user_followers'], ascending=False)

Unnamed: 0,user_name,user_followers
1602,CGTN,13892841
45,CGTN,13892795
667,CGTN,13892793
1231,CGTN,13892792
5867,CGTN,13892212
...,...,...
26520,Parent Security,100
13334,sunfarms,100
26183,Parent Security,100
22187,AlwaysSlay,100


Note that one account may have multiple tweets in the dataset. It is also possible for them to have different follower counts at different times, so seeing the same user multiple times with differing follower counts is normal.

### Aggregation <a class="anchor" id="aggregation"></a>

We may need make an aggregation to get some statistics. For example, we can count the number of tweets (rows) for each user (assuming users do not change their usernames) and then sort them:

In [31]:
dataset.groupby(by=["user_name"]).size().sort_values(ascending=False)

user_name
Coronavirus Updates       624
covidnews.ch              365
Covid Data                208
Hindustan Times           174
Paperbirds_Coronavirus    147
                         ... 
Pinklemonade306             1
Pinkvilla                   1
Pinkvilla Telly             1
Pinnacle Wealth Mgmt        1
!F                          1
Length: 26287, dtype: int64

### Join <a class="anchor" id="join"></a>

Let us assume that our dataset is not a one big table. For example, we may have separate datasets for users and tweets. If we need to combine them, we can join them through their key (in this case, user_name).

To do so, let us firstly separate users and tweets into separate DataFrame objects:

In [32]:
users = dataset[["user_name", "user_location", "user_description", "user_followers", 
                 "user_friends", "user_favourites", "user_verified"]].copy()
tweets = dataset[["user_name", "date", "text", "hashtags", "source", "is_retweet"]]
# Notice that tweets and users both have user_name column.

Note that there are users who have multiple tweets in our dataset, and we need to remove the duplicates from users before we try to join them. However, since their follower counts or other data can differ between tweets, removing the duplicates is not as straightforward. For demonstration purposes, we will simply use "user_name" column, keep their first occurrences in the dataset, and remove the rest:

In [33]:
# This requires users to be an independent copy of the dataset (see above) in order 
# to not give an error:
users.drop_duplicates(subset="user_name", inplace = True)

You may have noticed that we used `copy()` while slicing the dataset into users. It is because whenever we slice a DataFrame object, it is normally a shallow copy of the whole dataset for efficiency, meaning they actually both refer to the same data in the computer's memory. Since we need to manipulate the slice (users) which actually refers to the whole data (dataset), it produces a warning to draw our attention to potential issues. Using `copy()`, we actually create a totally independent DataFrame object. Note that warnings do not prevent you from doing anything as errors do, they simply point to possible problems.

Now, we can (inner) join these separate datasets to see tweets with their users:

In [34]:
users_tweets = pd.merge(users, tweets, how="inner", on="user_name")

users_tweets.head()

Unnamed: 0,user_name,user_location,user_description,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Prathamesh Bendre,,"A poet, reiki practitioner and a student of law.",25,29,18,False,2020-07-25 12:26:59,Praying for good health and recovery of @Chouh...,"['covid19', 'covidPositive']",Twitter for Android,False
1,ChennaiCityNow,,Individual tweeting about significant happenin...,3987,53,749,False,2020-07-25 12:26:44,July 25 #COVID19 update\r\n#TamilNadu - 6988\r...,"['COVID19', 'TamilNadu', 'chennai']",Twitter for iPhone,False
2,marc goovaerts🇪🇺🏳️‍🌈,Brussels,"Progressive mind. Flemish. Into movies, politi...",283,1432,1546,False,2020-07-25 12:26:44,Second wave of #COVID19 in Flanders..back to m...,"['COVID19', 'homework']",Twitter for Android,False
3,Florian Bieber,Graz,"Niko i ništa, professor, so-called Balkan expe...",18145,1389,13578,False,2020-07-25 12:26:28,Holy water in times of #COVID19 https://t.co/Y...,['COVID19'],Twitter for Android,False
4,Ms Paz,United States,,127,974,30217,False,2020-07-25 12:26:21,#FEMA acknowledges #PuertoRico lacks rebuilt h...,"['FEMA', 'PuertoRico', 'COVID19']",Twitter for iPhone,False


We can see that now we have a one-piece table that is similar to the one we imported.

## More information <a class="anchor" id="more"></a>

There are so many things you can do with Pandas, and there are many resources out there. Since it is very popular, if you have a question, it is very likely that StackOverflow already has the answer. Still, you can send me an e-mail if you have questions.

* [Pandas documentation](https://pandas.pydata.org/docs/)
* [Pandas community tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
* [A recent and extensive Pandas video tutorial](https://youtu.be/PcvsOaixUh8)