# Pandas (2)

Let's learn more advanced features of Pandas

**Outline**
- Function application
- Grouping and Aggregating
- Joining Dataframes
- Working with Text Data


In [None]:
import numpy as np
import pandas as pd

## Function application

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below.

###1. Row or column-wise function application
Arbitrary functions can be applied along the axes of a DataFrame using the `apply()` method, which, like the descriptive statistics methods, takes an optional `axis` argument:

In [None]:
index = pd.date_range("1/1/2000", periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

In [None]:
df

Unnamed: 0,A,B,C
2000-01-01,1.158119,0.735191,-0.494428
2000-01-02,-0.596823,0.432077,-1.020516
2000-01-03,0.176651,-0.298855,0.186945
2000-01-04,-0.049657,-0.815672,0.782848
2000-01-05,1.289411,-1.47765,-0.064191
2000-01-06,0.95909,-1.052802,-0.906679
2000-01-07,-0.304072,0.51401,1.34321
2000-01-08,0.247464,0.159653,1.519955


In [None]:
df.apply(np.mean) #default axis = 0 or ‘index’: apply function to each column.

A    0.300835
B   -0.506485
C   -0.331174
dtype: float64

In [None]:
df.apply(np.mean, axis=1) # axis = 1 or ‘columns’: apply function to each row.

2000-01-01   -0.009411
2000-01-02   -0.574021
2000-01-03    0.049999
2000-01-04    0.692938
2000-01-05   -0.164656
2000-01-06   -0.842239
2000-01-07   -0.532769
2000-01-08   -0.051373
Freq: D, dtype: float64

In [None]:
# df.apply(lambda x: x.max() - x.min(), axis=1)
def calRange(x):
  return x.max() - x.min()

df.apply(calRange, axis=1)

Unnamed: 0,0
2000-01-01,1.652547
2000-01-02,1.452593
2000-01-03,0.4858
2000-01-04,1.59852
2000-01-05,2.767061
2000-01-06,2.011892
2000-01-07,1.647282
2000-01-08,1.360302


You may also pass additional arguments and keyword arguments to the apply() method. For instance, consider the following function you would like to apply:

In [None]:
def subtract_and_divide(x, sub, divide=1):
  return (x - sub) / divide

df.apply(subtract_and_divide, sub=5, divide=2)

Unnamed: 0,A,B,C
2000-01-01,-1.92094,-2.132405,-2.747214
2000-01-02,-2.798411,-2.283962,-3.010258
2000-01-03,-2.411674,-2.649427,-2.406527
2000-01-04,-2.524828,-2.907836,-2.108576
2000-01-05,-1.855295,-3.238825,-2.532096
2000-01-06,-2.020455,-3.026401,-2.953339
2000-01-07,-2.652036,-2.242995,-1.828395
2000-01-08,-2.376268,-2.420174,-1.740023


In [None]:
# Using apply with Series - apply function to each element in the Series
df['A'].apply(lambda x: x + 5)

Unnamed: 0,A
2000-01-01,6.158119
2000-01-02,4.403177
2000-01-03,5.176651
2000-01-04,4.950343
2000-01-05,6.289411
2000-01-06,5.95909
2000-01-07,4.695928
2000-01-08,5.247464


###2. Applying elementwise functions
Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods `applymap()` on **DataFrame** and analogously `map()` on **Series** accept any Python function *taking a single value and returning a single value.* For example:

In [None]:
df2 = pd.DataFrame(
    [["Apple", 1, 100],
     ["Banana", 2, 50],
     ["Orange", 5, np.nan],
     ["Mango", 3, 250]],
    index=["a", "b", "c", "d"],
    columns=["Fruit", "Amount", "Price"]
)
df2

Unnamed: 0,Fruit,Amount,Price
a,Apple,1,100.0
b,Banana,2,50.0
c,Orange,5,
d,Mango,3,250.0


In [None]:
df2["Fruit"].map(len)

Unnamed: 0,Fruit
a,5
b,6
c,6
d,5


In [None]:
def get_len(e):
    return len(str(e))

df2.applymap(get_len)

  df2.applymap(get_len)


Unnamed: 0,Fruit,Amount,Price
a,5,1,5
b,6,1,4
c,6,1,3
d,5,1,5


**EX: Student Alcohol Consumption**

In [None]:
csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv'
df = pd.read_csv(csv_url)
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


**For the purpose of this exercise slice the dataframe from 'school' until the 'guardian' column**

In [None]:
df = df.loc[:, 'school':'guardian']
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother
4,GP,F,16,U,GT3,T,3,3,other,other,home,father


**Create a lambda function that will capitalize strings.**

**[Hint]** using `.capitalize()` method

**Capitalize both `Mjob` and `Fjob`**

In [None]:
# YOUR CODE HERE
def capFunc(job):
  return job.capitalize()
# print(df['Mjob'].apply(capFunc))

df['Mjob'] = df['Mjob'].apply(capFunc)
df['Fjob'] = df['Fjob'].apply(capFunc)

df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian
0,GP,F,18,U,GT3,A,4,4,At_home,Teacher,course,mother
1,GP,F,17,U,GT3,T,1,1,At_home,Other,course,father
2,GP,F,15,U,LE3,T,1,1,At_home,Other,other,mother
3,GP,F,15,U,GT3,T,4,2,Health,Services,home,mother
4,GP,F,16,U,GT3,T,3,3,Other,Other,home,father
...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,Services,Services,course,other
391,MS,M,17,U,LE3,T,3,1,Services,Services,course,mother
392,MS,M,21,R,GT3,T,1,1,Other,Other,course,other
393,MS,M,18,R,LE3,T,3,2,Services,Other,course,mother


**Create a function called `majority` that returns a boolean value to a new column called** `legal_drinker`

**(Consider majority as older than 17 years old)**

In [None]:
# YOUR CODE HERE
def majority(x):
  return x['age'] > 17

df['legal_drinker'] = df.apply(majority, axis=1)

In [None]:
df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,legal_drinker
0,GP,F,18,U,GT3,A,4,4,At_home,Teacher,course,mother,True
1,GP,F,17,U,GT3,T,1,1,At_home,Other,course,father,False
2,GP,F,15,U,LE3,T,1,1,At_home,Other,other,mother,False
3,GP,F,15,U,GT3,T,4,2,Health,Services,home,mother,False
4,GP,F,16,U,GT3,T,3,3,Other,Other,home,father,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,20,U,LE3,A,2,2,Services,Services,course,other,True
391,MS,M,17,U,LE3,T,3,1,Services,Services,course,mother,False
392,MS,M,21,R,GT3,T,1,1,Other,Other,course,other,True
393,MS,M,18,R,LE3,T,3,2,Services,Other,course,mother,True


###3. Tablewise function application
DataFrames and Series can be passed into functions.

In [None]:
df_p = pd.DataFrame({"state_and_code": ["Illinois, IL", "California, CA", "Florida, FL"]})
df_p

Unnamed: 0,state_and_code
0,"Illinois, IL"
1,"California, CA"
2,"Florida, FL"


In [None]:
def extract_state_name(df):
    """
    Illinois, IL -> Illinois for state_name column
    """
    df["state_name"] = df["state_and_code"].str.split(",").str.get(0)
    df["state_code"] = df["state_and_code"].str.split(",").str.get(1)
    return df


In [None]:
df_p_state_name = extract_state_name(df_p)
df_p_state_name

Unnamed: 0,state_and_code,state_name,state_code
0,"Illinois, IL",Illinois,IL
1,"California, CA",California,CA
2,"Florida, FL",Florida,FL


## Grouping and aggregation

In [None]:
df = pd.read_csv('vgsales.csv')

This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.

**Fields include:**

- Rank - Ranking of overall sales
- Name - The game's name
- Platform - Platform of the games release (i.e. PC,PS4, etc.)
- Year - Year of the game's release
- Genre - Genre of the game
- Publisher - Publisher of the game
- NA_Sales - Sales in North America (in millions)
- EU_Sales - Sales in Europe (in millions)
- JP_Sales - Sales in Japan (in millions)
- Other_Sales - Sales in the rest of the world (in millions)
- Global_Sales - Total worldwide sales.

In [None]:
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


###value_counts

In [None]:
df['Platform'].value_counts()

Unnamed: 0_level_0,count
Platform,Unnamed: 1_level_1
DS,2163
PS2,2161
PS3,1329
Wii,1325
X360,1265
PSP,1213
PS,1196
PC,960
XB,824
GBA,822


In [None]:
df.value_counts('Platform')

Unnamed: 0_level_0,count
Platform,Unnamed: 1_level_1
DS,2163
PS2,2161
PS3,1329
Wii,1325
X360,1265
PSP,1213
PS,1196
PC,960
XB,824
GBA,822


In [None]:
df['Genre'].value_counts(normalize=True)
# normalize - if True then the object returned will contain
# the relative frequencies of the unique values.

Unnamed: 0_level_0,proportion
Genre,Unnamed: 1_level_1
Action,0.199783
Sports,0.141342
Misc,0.104772
Role-Playing,0.089649
Shooter,0.078925
Adventure,0.077479
Racing,0.07525
Platform,0.05338
Simulation,0.052235
Fighting,0.05109


###Group by: split-apply-combine

By “group by” we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria.
- **Applying** a function to each group independently.
- **Combining** the results into a data structure.

Let's try to get status counts of each category.

In [None]:
df.groupby('Publisher')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x79a6b1321d20>

In [None]:
df.groupby('Publisher')['Genre'].value_counts()[:20]

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Publisher,Genre,Unnamed: 2_level_1
10TACLE Studios,Adventure,1
10TACLE Studios,Strategy,1
10TACLE Studios,Puzzle,1
1C Company,Racing,1
1C Company,Role-Playing,1
1C Company,Strategy,1
20th Century Fox Video Games,Action,4
20th Century Fox Video Games,Shooter,1
2D Boy,Puzzle,1
3DO,Action,17


In [None]:
s = df.groupby(['Genre', 'Platform'])['Global_Sales'].sum()
s[:20]

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Platform,Unnamed: 2_level_1
Action,2600,29.34
Action,3DS,57.02
Action,DC,1.26
Action,DS,115.56
Action,GB,7.92
Action,GBA,55.76
Action,GC,37.84
Action,GEN,2.74
Action,N64,29.58
Action,NES,28.75


In [None]:
s[:20].unstack()

Platform,2600,3DS,DC,DS,GB,GBA,GC,GEN,N64,NES,PC,PS,PS2,PS3,PS4,PSP,PSV,SAT,SNES,Wii
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Action,29.34,57.02,1.26,115.56,7.92,55.76,37.84,2.74,29.58,28.75,31.53,127.05,272.76,307.88,87.06,64.72,20.01,0.65,10.08,118.58


In [None]:
df.groupby(['Genre', 'Platform'], as_index=False)['Global_Sales'].sum()

Unnamed: 0,Genre,Platform,Global_Sales
0,Action,2600,29.34
1,Action,3DS,57.02
2,Action,DC,1.26
3,Action,DS,115.56
4,Action,GB,7.92
...,...,...,...
288,Strategy,Wii,5.23
289,Strategy,WiiU,1.24
290,Strategy,X360,10.13
291,Strategy,XB,2.78


**EX: หาค่าเฉลี่ยของ `Global_Sales` ของ game แต่ละ `Genre` และทำการเรียงลำดับจากมากไปน้อย**

*Hint: เรียงลำดับด้วย `sort_values()`*

In [None]:
# groupby ด้วย Genre
genres = df.groupby('Genre')['Global_Sales'].mean()
genres.sort_values(ascending=False)

Unnamed: 0_level_0,Global_Sales
Genre,Unnamed: 1_level_1
Platform,0.938341
Shooter,0.791885
Role-Playing,0.623233
Racing,0.586101
Sports,0.567319
Fighting,0.529375
Action,0.5281
Misc,0.465762
Simulation,0.452364
Puzzle,0.420876


**EX: แสดงจำนวน game ในแต่ละ `Genre` สำหรับ game ใน `Platform` = `Wii` และ `X360`**

*Hint: use `.loc[]`*

In [None]:
states = df.groupby('Platform')['Genre'].value_counts()

states.loc[['Wii', 'X360']]

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Platform,Genre,Unnamed: 2_level_1
Wii,Misc,280
Wii,Sports,261
Wii,Action,238
Wii,Racing,94
Wii,Simulation,87
Wii,Adventure,84
Wii,Shooter,66
Wii,Platform,58
Wii,Puzzle,55
Wii,Fighting,42


###Aggregation

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.

We can use the `agg()` method.

In [None]:
df.groupby('Genre')['Global_Sales'].agg(['mean', 'median', 'sum'])

Unnamed: 0_level_0,mean,median,sum
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,0.5281,0.19,1751.18
Adventure,0.185879,0.06,239.04
Fighting,0.529375,0.21,448.91
Misc,0.465762,0.16,809.96
Platform,0.938341,0.28,831.37
Puzzle,0.420876,0.11,244.95
Racing,0.586101,0.19,732.04
Role-Playing,0.623233,0.185,927.37
Shooter,0.791885,0.23,1037.37
Simulation,0.452364,0.16,392.2


### Applying a function to each group independently

เรามาลองหาว่ามี game จำนวนเท่าไหร่ที่ได้ยอดขาย `JP_Sales` มากกว่า `NA_Sales` ในแต่ละ `Genre`

In [None]:
def my_func(game):
  return len(game[game['JP_Sales'] > game['NA_Sales']])

df.groupby('Genre').apply(my_func)

Unnamed: 0_level_0,0
Genre,Unnamed: 1_level_1
Action,684
Adventure,675
Fighting,301
Misc,450
Platform,94
Puzzle,104
Racing,78
Role-Playing,772
Shooter,107
Simulation,201


**EX: ลองหาว่าสำหรับ game ที่ได้ยอดขาย `JP_Sales` มากกว่า `NA_Sales` ในแต่ละ `Genre` นั้นมีค่าเฉลี่ยของผลต่างยอดขาย `JP_Sales` - `NA_Sales` เท่าไหร่**

In [None]:
# Your code here
def my_func(game):
  game['diff'] = game['JP_Sales'] - game['NA_Sales']
  filt = game['diff'] > 0
  return game[filt]['diff'].mean()

df.groupby('Genre').apply(my_func)

Unnamed: 0_level_0,0
Genre,Unnamed: 1_level_1
Action,0.112646
Adventure,0.057807
Fighting,0.158073
Misc,0.159222
Platform,0.272553
Puzzle,0.271635
Racing,0.178718
Role-Playing,0.263355
Shooter,0.136075
Simulation,0.200995


## Joining DataFrames

With pandas, you can `merge`, `join`, and `concatenate` your datasets, allowing you to unify and better understand your data as you analyze it.

- `merge()` for combining data on common columns or indices
- `join()` for combining data on a key column or an index
- `concat()` for combining DataFrames across rows or columns

[Credit](https://realpython.com/pandas-merge-join-and-concat/)

### How to Use merge()

When you use `merge()`, you’ll provide two required arguments:

- The `left` DataFrame
- The `right` DataFrame

After that, you can provide a number of optional arguments to define how your datasets are merged:

- `how` defines what kind of merge to make. It defaults to 'inner', but other possible options include 'outer', 'left', and 'right'.

- `on` tells `merge()` which columns or indices, also called key columns or key indices, you want to join on. This is optional. If it isn’t specified, and `left_index` and `right_index` (covered below) are False, then columns from the two DataFrames that share names will be used as join keys. If you use `on`, then the column or index that you specify must be present in both objects.

- `left_on` and `right_on` specify a column or index that’s present only in the left or right object that you’re merging. Both default to None.

- `left_index` and `right_index` both default to False, but if you want to use the index of the left or right object to be merged, then you can set the relevant argument to True.

- `suffixes` is a tuple of strings to append to identical column names that aren’t merge keys. This allows you to keep track of the origins of columns with the same name.

![joins](https://files.realpython.com/media/join_diagram.93e6ef63afbe.png)

**Merge methods:**

- **left** => `LEFT OUTER JOIN` - Use keys from left frame only
- **right** => `RIGHT OUTER JOIN` - Use keys from right frame only
- **outer** => `FULL OUTER JOIN` - Use union of keys from both frames
- **inner** => `INNER JOIN` - Use intersection of keys from both frames

In [None]:
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

In [None]:
result = pd.merge(left, right, on="key")
result

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


We can also join with mutiple keys.

In this example, we join `left` and `right` with keys: `key1` and `key2`

In [None]:
left = pd.DataFrame(
   {
      "key1": ["K0", "K0", "K1", "K2"],
      "key2": ["K0", "K1", "K0", "K1"],
      "A": ["A0", "A1", "A2", "A3"],
      "B": ["B0", "B1", "B2", "B3"],
   }
)


right = pd.DataFrame(
   {
      "key1": ["K0", "K1", "K1", "K2"],
      "key2": ["K0", "K0", "K0", "K0"],
      "C": ["C0", "C1", "C2", "C3"],
      "D": ["D0", "D1", "D2", "D3"],
   }
)

In [None]:
result = pd.merge(left, right, how="left", on=["key1", "key2"])

result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


#### Inner Join

In [None]:
result = pd.merge(left, right, how="inner", on=["key1", "key2"])

result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


#### Outer Join

In [None]:
result = pd.merge(left, right, how="outer", on=["key1", "key2"])

result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,
5,K2,K0,,,C3,D3


**Excercise**

Do `left` join between `products` and `categories`

In [5]:
products = pd.DataFrame({
    "ID": [1, 2, 3, 4, 5],
    "NAME": ["iPhone 15", "Galaxy Fold 6", "TV", "Pillow", "Generic Blanket"],
    "PRICE": [50000, 60000, 15000, 1200, 500],
    "CAT_ID": [1, 1, 2, 3, 3]
})
products

Unnamed: 0,ID,NAME,PRICE,CAT_ID
0,1,iPhone 15,50000,1
1,2,Galaxy Fold 6,60000,1
2,3,TV,15000,2
3,4,Pillow,1200,3
4,5,Generic Blanket,500,3


In [6]:
categories = pd.DataFrame({
    "ID": [1, 2, 3, 4, 5],
    "NAME": ["Mobile", "Electronics", "Household", "Kitchen", "Toys"]
})
categories

Unnamed: 0,ID,NAME
0,1,Mobile
1,2,Electronics
2,3,Household
3,4,Kitchen
4,5,Toys


In [7]:
# YOUR CODE HERE
result = pd.merge(products, categories, how="left", left_on="CAT_ID", right_on="ID")
result

Unnamed: 0,ID_x,NAME_x,PRICE,CAT_ID,ID_y,NAME_y
0,1,iPhone 15,50000,1,1,Mobile
1,2,Galaxy Fold 6,60000,1,1,Mobile
2,3,TV,15000,2,2,Electronics
3,4,Pillow,1200,3,3,Household
4,5,Generic Blanket,500,3,3,Household


Now, let's do `right` join between `products` and `categories`

In [8]:
# YOUR CODE HERE
result = pd.merge(products, categories, how="right", left_on="CAT_ID", right_on="ID")
result

Unnamed: 0,ID_x,NAME_x,PRICE,CAT_ID,ID_y,NAME_y
0,1.0,iPhone 15,50000.0,1.0,1,Mobile
1,2.0,Galaxy Fold 6,60000.0,1.0,1,Mobile
2,3.0,TV,15000.0,2.0,2,Electronics
3,4.0,Pillow,1200.0,3.0,3,Household
4,5.0,Generic Blanket,500.0,3.0,3,Household
5,,,,,4,Kitchen
6,,,,,5,Toys


From the joined dataframe, can you find number of product in each product category.

In [9]:
# YOUR CODE HERE
result.groupby('NAME_y')['ID_x'].count()

Unnamed: 0_level_0,ID_x
NAME_y,Unnamed: 1_level_1
Electronics,1
Household,2
Kitchen,0
Mobile,2
Toys,0


### .join()
`.join()` uses `merge()` under the hood, but provides a much more simplified interface to `merge()` and by default joins on indexes. Here is an introductory example using the `lsuffix` and `rsuffix` parameters to handle overlapping column names.

In [None]:
left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
)


right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
)

In [None]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [None]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [None]:
left.join(right, how="inner")

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


### concat()

First, you will see a basic concatenation along axis 0.

In [None]:
reindexed = pd.concat(
    [left, right], ignore_index=True
)
reindexed

Unnamed: 0,A,B,C,D
0,A0,B0,,
1,A1,B1,,
2,A2,B2,,
3,,,C0,D0
4,,,C2,D2
5,,,C3,D3


In [None]:
results = pd.concat(
    [left, right], axis=1
)
results

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


In [None]:
results = pd.concat(
    [left, right], axis=1, join="inner"
)
results

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2
