credit to : [Pandas Tutorials Series](https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/) by data36

# Pandas Tutorial 1: Pandas Basics
- Reading Data Files
- DataFrames
- Data Selection

In [1]:
import pandas as pd

In [8]:
df_zoo = pd.read_csv("zoo_animals.csv", delimiter=",")
df_zoo

Unnamed: 0,animal,uniq_id,water_need
0,elephant,1001,500
1,elephant,1002,600
2,elephant,1003,550
3,tiger,1004,300
4,tiger,1005,320
5,tiger,1006,330
6,tiger,1007,290
7,tiger,1008,310
8,zebra,1009,200
9,zebra,1010,220


In [15]:
df_articles = pd.read_csv("pandas_tutorial_read.csv", delimiter=";")
df_articles

Unnamed: 0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
0,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
1,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
2,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
3,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America
4,2018-01-01 00:05:42,read,country_6,2458151266,Reddit,North America
...,...,...,...,...,...,...
1789,2018-01-01 23:57:14,read,country_2,2458153051,AdWords,North America
1790,2018-01-01 23:58:33,read,country_8,2458153052,SEO,Asia
1791,2018-01-01 23:59:36,read,country_6,2458153053,Reddit,Asia
1792,2018-01-01 23:59:36,read,country_7,2458153054,AdWords,Europe


### Naming columns while reading files

In [16]:
df_articles = pd.read_csv("pandas_tutorial_read.csv", delimiter=";", names=["my_datetime", "event", "country", "user_id", "source", "topic"])
df_articles

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
2,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
3,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
4,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America
...,...,...,...,...,...,...
1790,2018-01-01 23:57:14,read,country_2,2458153051,AdWords,North America
1791,2018-01-01 23:58:33,read,country_8,2458153052,SEO,Asia
1792,2018-01-01 23:59:36,read,country_6,2458153053,Reddit,Asia
1793,2018-01-01 23:59:36,read,country_7,2458153054,AdWords,Europe


### Get random Samples from dataset

In [17]:
df_articles.sample(5)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
457,2018-01-01 06:09:59,read,country_6,2458151718,AdWords,Asia
477,2018-01-01 06:24:46,read,country_2,2458151738,Reddit,South America
440,2018-01-01 05:59:55,read,country_5,2458151701,Reddit,South America
389,2018-01-01 05:19:34,read,country_5,2458151650,Reddit,Asia
1429,2018-01-01 19:25:51,read,country_2,2458152690,SEO,North America


### Select specific columns of your dataframe
the outer bracket frames tell pandas that you want to select columns, and the inner brackets are for the list (remember? Python lists go between bracket frames) of the column names.

In [18]:
df_articles[["source","topic"]]

Unnamed: 0,source,topic
0,SEO,North America
1,SEO,South America
2,AdWords,Africa
3,AdWords,Europe
4,Reddit,North America
...,...,...
1790,AdWords,North America
1791,SEO,Asia
1792,Reddit,Asia
1793,AdWords,Europe


In [21]:
df_articles[["country","user_id"]]

Unnamed: 0,country,user_id
0,country_7,2458151261
1,country_7,2458151262
2,country_7,2458151263
3,country_7,2458151264
4,country_8,2458151265
...,...,...
1790,country_2,2458153051
1791,country_8,2458153052
1792,country_6,2458153053
1793,country_7,2458153054


## Select Pandas Series instead of Dataframe
Sometimes (especially in predictive analytics projects), you want to get Series objects instead of DataFrames. You can get a Series using any of these two syntaxes (and selecting only one column):
- article_read.user_id
- article_read['user_id']

In [24]:
df_articles.source

0           SEO
1           SEO
2       AdWords
3       AdWords
4        Reddit
         ...   
1790    AdWords
1791        SEO
1792     Reddit
1793    AdWords
1794     Reddit
Name: source, Length: 1795, dtype: object

In [25]:
df_articles['source']

0           SEO
1           SEO
2       AdWords
3       AdWords
4        Reddit
         ...   
1790    AdWords
1791        SEO
1792     Reddit
1793    AdWords
1794     Reddit
Name: source, Length: 1795, dtype: object

### Filter for specific values in your dataframe

Let’s say, you want to see a list of only the users who came from the ‘SEO’ source. In this case you have to filter for the ‘SEO’ value in the ‘source’ column:

In [26]:
df_articles[df_articles["source"] == "SEO"]

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
11,2018-01-01 00:08:57,read,country_7,2458151272,SEO,Australia
15,2018-01-01 00:11:22,read,country_7,2458151276,SEO,North America
16,2018-01-01 00:13:05,read,country_8,2458151277,SEO,North America
...,...,...,...,...,...,...
1772,2018-01-01 23:45:58,read,country_7,2458153033,SEO,South America
1777,2018-01-01 23:49:52,read,country_5,2458153038,SEO,North America
1779,2018-01-01 23:51:25,read,country_4,2458153040,SEO,South America
1784,2018-01-01 23:54:03,read,country_2,2458153045,SEO,North America


### Functions can be used after each other

In [28]:
df_articles.head()[["source","country"]]

Unnamed: 0,source,country
0,SEO,country_7
1,SEO,country_7
2,AdWords,country_7
3,AdWords,country_7
4,Reddit,country_8


In [29]:
df_articles[["source","country"]].head()

Unnamed: 0,source,country
0,SEO,country_7
1,SEO,country_7
2,AdWords,country_7
3,AdWords,country_7
4,Reddit,country_8


### Test yourself!
Select the user_id, the country and the topic columns for the users who are from country_2! Print the first five rows only!

In [36]:
df_articles[df_articles["country"] == "country_2"][["user_id", "country", "topic"]].head(5)

Unnamed: 0,user_id,country,topic
6,2458151267,country_2,Europe
13,2458151274,country_2,Europe
17,2458151278,country_2,Asia
19,2458151280,country_2,Asia
20,2458151281,country_2,Asia


----------------

# Pandas Tutorial 2: Aggregation and Grouping

- Let’s count the number of rows (the number of animals) in zoo!
- Let’s calculate the total water_need of the animals!
- Let’s find out which is the smallest water_need value!
- And then the greatest water_need value!
- And eventually the average water_need!

In [37]:
import pandas as pd

In [38]:
df_zoo = pd.read_csv("zoo_animals.csv", delimiter=",")
df_zoo

Unnamed: 0,animal,uniq_id,water_need
0,elephant,1001,500
1,elephant,1002,600
2,elephant,1003,550
3,tiger,1004,300
4,tiger,1005,320
5,tiger,1006,330
6,tiger,1007,290
7,tiger,1008,310
8,zebra,1009,200
9,zebra,1010,220


## Pandas Data Aggregation #1: .count()

### the number of rows (the number of animals) in zoo!

In [40]:
df_zoo.size

66

In [41]:
df.shape

(22, 3)

In [42]:
df.ndim

2

In [43]:
df_zoo.count()

animal        22
uniq_id       22
water_need    22
dtype: int64

In [51]:
total_animals = df_zoo["animal"].count()
print("Total Number of Animals in Zoo: ", total_animals)

toal_animals = df_zoo.animal.count()
print("Total Number of Animals in Zoo: ", total_animals)

Total Number of Animals in Zoo:  22
Total Number of Animals in Zoo:  22


## Pandas Data Aggregation #2: .sum()

###  the total water_need of the animals!

In [49]:
total_water_need = df_zoo["water_need"].sum()
print("Total Water Needs : ", total_water_need)

#alternative
total_water_need = df_zoo.water_need.sum()
print("Total Water Needs : ", total_water_need)

Total Water Needs :  7650
Total Water Needs :  7650


In [53]:
df_zoo.sum()
# as per result, you can see .sum() turns the words of the animal column into one string of animal names.

animal        elephantelephantelephanttigertigertigertigerti...
uniq_id                                                   22253
water_need                                                 7650
dtype: object

## Pandas Data Aggregation #3 and #4: .min() and .max()

### the smallest water_need value! & Most water need

In [54]:
smallest_water_need = df_zoo["water_need"].min()
print("Smallest Water need: ", smallest_water_need)

Smallest Water need:  80


In [55]:
most_water_need = df_zoo["water_need"].max()
print("Most Water Need: ", most_water_need)

Most Water Need:  600


## Pandas Data aggregation #5 and #6: .mean() and .median()

### the average water_need!

In [56]:
average_water_need = df_zoo["water_need"].mean()
print("Average Water Need: ", average_water_need)

Average Water Need:  347.72727272727275


### median water need

In [57]:
median_water_need = df_zoo["water_need"].median()
print("Median Water Need: ", median_water_need)

Median Water Need:  325.0


### Avareage water needs per Each Animal Group

In [66]:
# returns as Data Frame
df_avg_water_needs_by_animal_types = df_zoo.groupby("animal").mean()[["water_need"]]
df_avg_water_needs_by_animal_types

Unnamed: 0_level_0,water_need
animal,Unnamed: 1_level_1
elephant,550.0
kangaroo,416.666667
lion,477.5
tiger,310.0
zebra,184.285714


In [68]:
# returns as Series Object
ds_avg_water_needs_by_animal_types = df_zoo.groupby("animal").mean().water_need
ds_avg_water_needs_by_animal_types

animal
elephant    550.000000
kangaroo    416.666667
lion        477.500000
tiger       310.000000
zebra       184.285714
Name: water_need, dtype: float64

## Test yourself #1

In [72]:
df_articles.sample(5)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
780,2018-01-01 10:33:30,read,country_2,2458152041,AdWords,Europe
862,2018-01-01 11:39:48,read,country_7,2458152123,Reddit,Australia
17,2018-01-01 00:13:06,read,country_2,2458151278,Reddit,Asia
744,2018-01-01 10:03:33,read,country_6,2458152005,AdWords,Europe
237,2018-01-01 03:17:35,read,country_7,2458151498,Reddit,Asia


### What’s the most frequent source in the article dataframe?

In [86]:
most_frequent_source = df_articles["source"].mode()
print("Most Frequent Source: ", most_frequent_source)

Most Frequent Source:  0    Reddit
dtype: object


In [92]:
df_articles.groupby("source").count()[["topic"]]

Unnamed: 0_level_0,topic
source,Unnamed: 1_level_1
AdWords,500
Reddit,949
SEO,346


## Test yourself #2

### For the users of country_2, what was the most frequent topic and source combination? Or in other words: which topic, from which source, brought the most views from country_2?

In [93]:
df_articles.head()

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
2,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
3,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
4,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America


In [107]:
# after we aggregrate (group by), we need to do something like count, sum, min, etc LIKE IN SQL
df_articles[df_articles["country"] == "country_2"].groupby(["topic", "source"]).count()[["event"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,event
topic,source,Unnamed: 2_level_1
Africa,AdWords,3
Africa,Reddit,24
Africa,SEO,7
Asia,AdWords,31
Asia,Reddit,139
Asia,SEO,9
Australia,AdWords,6
Australia,Reddit,18
Australia,SEO,10
Europe,AdWords,46


In [114]:
df_articles[df_articles["country"] == "country_2"].groupby(["topic", "source"]).count().max()

my_datetime    139
event          139
country        139
user_id        139
dtype: int64

--------------

# Pandas Tutorial 3: Important Data Formatting Methods 
- merge
- sort
- reset_index
- fillna

## Pandas Merge (a.k.a. “joining” dataframes)

it’s almost the same as SQL’s JOIN method.

In [120]:
zoo_eats = pd.DataFrame([['elephant','vegetables'], ['tiger','meat'], ['kangaroo','vegetables'], ['zebra','vegetables'], ['giraffe','vegetables']], columns=['animal', 'food'])

In [122]:
df_zoo_eats

Unnamed: 0,animal,food
0,elephant,vegetables
1,tiger,meat
2,kangaroo,vegetables
3,zebra,meat
4,giraffe,vegetables


In [123]:
df_zoo

Unnamed: 0,animal,uniq_id,water_need
0,elephant,1001,500
1,elephant,1002,600
2,elephant,1003,550
3,tiger,1004,300
4,tiger,1005,320
5,tiger,1006,330
6,tiger,1007,290
7,tiger,1008,310
8,zebra,1009,200
9,zebra,1010,220


We want to merge these two pandas dataframes into one big dataframe.

In [134]:
#pandas default merge is like INNER JOIN, so lions are gone in result
df_zoo.merge(df_zoo_eats)

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001,500,vegetables
1,elephant,1002,600,vegetables
2,elephant,1003,550,vegetables
3,tiger,1004,300,meat
4,tiger,1005,320,meat
5,tiger,1006,330,meat
6,tiger,1007,290,meat
7,tiger,1008,310,meat
8,zebra,1009,200,meat
9,zebra,1010,220,meat


In [131]:
# using outer join
# with outer join , lions are in the result set
df_zoo.merge(df_zoo_eats, how="outer")

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001.0,500.0,vegetables
1,elephant,1002.0,600.0,vegetables
2,elephant,1003.0,550.0,vegetables
3,tiger,1004.0,300.0,meat
4,tiger,1005.0,320.0,meat
5,tiger,1006.0,330.0,meat
6,tiger,1007.0,290.0,meat
7,tiger,1008.0,310.0,meat
8,zebra,1009.0,200.0,meat
9,zebra,1010.0,220.0,meat


As per above result, the giraffe line would be misleading and irrelevant since we don’t have any giraffes in our zoo anyway. 

In [133]:
df_zoo.merge(df_zoo_eats, how="left")

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001,500,vegetables
1,elephant,1002,600,vegetables
2,elephant,1003,550,vegetables
3,tiger,1004,300,meat
4,tiger,1005,320,meat
5,tiger,1006,330,meat
6,tiger,1007,290,meat
7,tiger,1008,310,meat
8,zebra,1009,200,meat
9,zebra,1010,220,meat


In [136]:
df_zoo.merge(df_zoo_eats, how="right")

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001.0,500.0,vegetables
1,elephant,1002.0,600.0,vegetables
2,elephant,1003.0,550.0,vegetables
3,tiger,1004.0,300.0,meat
4,tiger,1005.0,320.0,meat
5,tiger,1006.0,330.0,meat
6,tiger,1007.0,290.0,meat
7,tiger,1008.0,310.0,meat
8,zebra,1009.0,200.0,meat
9,zebra,1010.0,220.0,meat


In [138]:
df_zoo.merge(df_zoo_eats, how="inner") #default one

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001,500,vegetables
1,elephant,1002,600,vegetables
2,elephant,1003,550,vegetables
3,tiger,1004,300,meat
4,tiger,1005,320,meat
5,tiger,1006,330,meat
6,tiger,1007,290,meat
7,tiger,1008,310,meat
8,zebra,1009,200,meat
9,zebra,1010,220,meat


## Pandas Merge. On which column?

zoo.merge(zoo_eats, how = 'left', left_on = 'animal', right_on = 'animal')

In [152]:
df_zoo.merge(df_zoo_eats, how="inner", on="animal")

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001,500,vegetables
1,elephant,1002,600,vegetables
2,elephant,1003,550,vegetables
3,tiger,1004,300,meat
4,tiger,1005,320,meat
5,tiger,1006,330,meat
6,tiger,1007,290,meat
7,tiger,1008,310,meat
8,zebra,1009,200,meat
9,zebra,1010,220,meat


## Sorting in pandas

In [153]:
df_zoo.sort_values("water_need")

Unnamed: 0,animal,uniq_id,water_need
14,zebra,1015,80
13,zebra,1014,100
8,zebra,1009,200
9,zebra,1010,220
12,zebra,1013,220
11,zebra,1012,230
10,zebra,1011,240
6,tiger,1007,290
3,tiger,1004,300
7,tiger,1008,310


In [158]:
df_zoo.sort_values(by = ["water_need", 'animal'], ascending=False)

Unnamed: 0,animal,uniq_id,water_need
16,lion,1017,600
1,elephant,1002,600
2,elephant,1003,550
17,lion,1018,500
0,elephant,1001,500
20,kangaroo,1021,430
15,lion,1016,420
19,kangaroo,1020,410
21,kangaroo,1022,410
18,lion,1019,390


## Reset_index

if you look at the result of above example and check the indexes number.
What a mess with all the indexes after that last sorting, right?

It’s not just that it’s ugly… wrong indexing can mess up your visualizations (more about that in my matplotlib tutorials) or even your machine learning models.

The point is: in certain cases, when you have done a transformation on your dataframe, you have to re-index the rows. For that, you can use the reset_index() method.

In [159]:
df_zoo.sort_values(by=["water_need", "animal"], ascending=False).reset_index()

Unnamed: 0,index,animal,uniq_id,water_need
0,16,lion,1017,600
1,1,elephant,1002,600
2,2,elephant,1003,550
3,17,lion,1018,500
4,0,elephant,1001,500
5,20,kangaroo,1021,430
6,15,lion,1016,420
7,19,kangaroo,1020,410
8,21,kangaroo,1022,410
9,18,lion,1019,390


### if we to remove old indexes, just drop it

In [160]:
df_zoo.sort_values(by=["water_need", "animal"], ascending=False).reset_index(drop=True)

Unnamed: 0,animal,uniq_id,water_need
0,lion,1017,600
1,elephant,1002,600
2,elephant,1003,550
3,lion,1018,500
4,elephant,1001,500
5,kangaroo,1021,430
6,lion,1016,420
7,kangaroo,1020,410
8,kangaroo,1022,410
9,lion,1019,390


## Fillna

In [161]:
df_zoo.merge(df_zoo_eats, how="left")

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001,500,vegetables
1,elephant,1002,600,vegetables
2,elephant,1003,550,vegetables
3,tiger,1004,300,meat
4,tiger,1005,320,meat
5,tiger,1006,330,meat
6,tiger,1007,290,meat
7,tiger,1008,310,meat
8,zebra,1009,200,meat
9,zebra,1010,220,meat


we have NaN values for lions. NaN itself can be really distracting, so I usually like to replace it with something more meaningful. In some cases, this can be a 0 value, or in other cases a specific string value, but this time, I’ll go with unknown.

In [162]:
df_zoo.merge(df_zoo_eats, how="left").fillna("unknown")

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001,500,vegetables
1,elephant,1002,600,vegetables
2,elephant,1003,550,vegetables
3,tiger,1004,300,meat
4,tiger,1005,320,meat
5,tiger,1006,330,meat
6,tiger,1007,290,meat
7,tiger,1008,310,meat
8,zebra,1009,200,meat
9,zebra,1010,220,meat


But we know lion eat "meat", so we will fill with meat

In [163]:
df_zoo.merge(df_zoo_eats, how="left").fillna("meat")

Unnamed: 0,animal,uniq_id,water_need,food
0,elephant,1001,500,vegetables
1,elephant,1002,600,vegetables
2,elephant,1003,550,vegetables
3,tiger,1004,300,meat
4,tiger,1005,320,meat
5,tiger,1006,330,meat
6,tiger,1007,290,meat
7,tiger,1008,310,meat
8,zebra,1009,200,meat
9,zebra,1010,220,meat


## Test yourself

In [210]:
df_articles.head(2)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America


Another data set

In [211]:
df_blog_buy = pd.read_csv("pandas_tutorial_buy.csv", delimiter=";", names = ['my_date_time', 'event', 'user_id', 'amount'])

In [212]:
df_blog_buy.head(2)

Unnamed: 0,my_date_time,event,user_id,amount
0,2018-01-01 04:04:59,buy,2458151555,8
1,2018-01-01 09:28:00,buy,2458151933,8


### TASK #1: What’s the average (mean) revenue between 2018-01-01 and 2018-01-07 from the users in the article_read dataframe?

In [214]:
df_whole = df_articles.merge(df_blog_buy, how="left", left_on="user_id", right_on ="user_id")
df_whole.head(2)

Unnamed: 0,my_datetime,event_x,country,user_id,source,topic,my_date_time,event_y,amount
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America,,,
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America,,,


In [215]:
df_whole["my_datetime"].min()
df_whole["my_datetime"].max()

'2018-01-01 23:59:38'

In [224]:
df_amount = df_whole.amount

In [225]:
amount = amount.fillna(0)

In [226]:
print("Average Revenue: ", amount.mean())

Average Revenue:  1.0852367688022284


### TASK #2: Print the top 3 countries by total revenue between 2018-01-01 and 2018-01-07! 

In [227]:
df_articles.head(2)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America


In [228]:
df_blog_buy.head(2)

Unnamed: 0,my_date_time,event,user_id,amount
0,2018-01-01 04:04:59,buy,2458151555,8
1,2018-01-01 09:28:00,buy,2458151933,8


In [244]:
df_whole = df_articles.merge(df_blog_buy, how="left", left_on="user_id", right_on="user_id").fillna(0)
df_whole.head(2)

Unnamed: 0,my_datetime,event_x,country,user_id,source,topic,my_date_time,event_y,amount
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America,0,0,0.0
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America,0,0,0.0


In [245]:
total_by_country = df_whole.groupby("country")[["amount"]].sum()
total_by_country

Unnamed: 0_level_0,amount
country,Unnamed: 1_level_1
country_1,0.0
country_2,296.0
country_3,0.0
country_4,1112.0
country_5,324.0
country_6,0.0
country_7,200.0
country_8,16.0


In [246]:
total_by_country.sort_values("amount", ascending=False).head(3)

Unnamed: 0_level_0,amount
country,Unnamed: 1_level_1
country_4,1112.0
country_5,324.0
country_2,296.0
