## Data Wrangling & Cleansing Exercise

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

### 1. Import & Load Data

import dan load data `mpg` dari `seaborn` lalu tampilkan 5 baris teratas

In [2]:
mpg = sns.load_dataset('mpg')
df = pd.DataFrame(mpg)
df.head(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### 2.  Berapa jumlah baris dan kolom dalam DataFrame "mpg"? Apa tipe data dari kolom-kolom tersebut?

In [3]:
# (_baris, _kolom)
df.shape

(398, 9)

In [4]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin           object
name             object
dtype: object

### 3. Berapa banyak mobil yang memiliki "mpg" lebih dari 25?

In [5]:
len(df[df['mpg'] > 25])

158

###  4. Berapa rata-rata "mpg" dan "horsepower" dari mobil dalam dataset?

In [6]:
mpg_mean = df['mpg'].mean()
horsepower_mean = df['horsepower'].mean()

df_means = pd.DataFrame({
    'mpg': [mpg_mean],
    'horsepower': [horsepower_mean]
}, index=['rata-rata'])

df_means

Unnamed: 0,mpg,horsepower
rata-rata,23.514573,104.469388


### 5. Berapa jumlah mobil dalam setiap kelompok berdasarkan negara asal ("origin")?

In [7]:
df['origin'].value_counts()

origin
usa       249
japan      79
europe     70
Name: count, dtype: int64

###  6. Apa nama mobil-mobil dengan "mpg" antara 20 dan 25?

In [8]:
df[(df['mpg'] >= 20) & (df['mpg'] <= 25)]['name'].reset_index(drop=True)

0     toyota corona mark ii
1           plymouth duster
2             ford maverick
3               peugeot 504
4               audi 100 ls
              ...          
84          ford granada gl
85     ford fairmont futura
86           amc concord dl
87    buick century limited
88           ford granada l
Name: name, Length: 89, dtype: object

### 7. Tampilkan 3 mobil dengan akselerasi tertinggi?

In [9]:
df.sort_values('acceleration', ascending=False).head(3)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
299,27.2,4,141.0,71.0,3190,24.8,79,europe,peugeot 504
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup
326,43.4,4,90.0,48.0,2335,23.7,80,europe,vw dasher (diesel)


### 8. Tambahkan kolom baru dengan nama kpl yang merupakan konversi dari mpg menjadi kilometer per liter (1 mpg = 0.42514 kpl)

In [10]:
df['kpl'] = df['mpg'] * 0.43514
df[['mpg','kpl']]

Unnamed: 0,mpg,kpl
0,18.0,7.83252
1,15.0,6.52710
2,18.0,7.83252
3,16.0,6.96224
4,17.0,7.39738
...,...,...
393,27.0,11.74878
394,44.0,19.14616
395,32.0,13.92448
396,28.0,12.18392


###  9. Bagaimana "mpg" rata-rata berubah dari tahun ke tahun?

In [11]:
df_rata2 = pd.DataFrame(df.groupby('model_year')['mpg'].mean())
df_rata2['difference'] = df_rata2['mpg'].diff()
df_rata2

Unnamed: 0_level_0,mpg,difference
model_year,Unnamed: 1_level_1,Unnamed: 2_level_1
70,17.689655,
71,21.25,3.560345
72,18.714286,-2.535714
73,17.1,-1.614286
74,22.703704,5.603704
75,20.266667,-2.437037
76,21.573529,1.306863
77,23.375,1.801471
78,24.061111,0.686111
79,25.093103,1.031992


### 10. Temukan mobil dengan konsumsi bahan bakar ("mpg") tertinggi, dan tampilkan informasi lengkapnya.

In [12]:
df.sort_values(by="mpg", ascending=False).head(1)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,kpl
322,46.6,4,86.0,65.0,2110,17.9,80,japan,mazda glc,20.277524


In [13]:
df.sort_values(by = 'mpg', ascending = False).iloc[0]

mpg                  46.6
cylinders               4
displacement         86.0
horsepower           65.0
weight               2110
acceleration         17.9
model_year             80
origin              japan
name            mazda glc
kpl             20.277524
Name: 322, dtype: object

###  11. Temukan mobil dengan kombinasi performa terbaik, yaitu memiliki "horsepower" tinggi, "mpg" tinggi, dan "acceleration" rendah. Tampilkan informasi lengkapnya.

In [14]:
df.sort_values(by = [ 'horsepower', 'mpg', 'acceleration'], ascending = [False, False, True]).head(1)
df.sort_values(by = [ 'horsepower', 'mpg', 'acceleration'], ascending = [False, False, True]).iloc[0]

mpg                           16.0
cylinders                        8
displacement                 400.0
horsepower                   230.0
weight                        4278
acceleration                   9.5
model_year                      73
origin                         usa
name            pontiac grand prix
kpl                        6.96224
Name: 116, dtype: object

### 12. Periksa apakah ada nilai yang hilang (missing values) dalam DataFrame "mpg", dan jika ada, tangani secara sesuai (misalnya, isi dengan nilai rata-rata).

In [15]:
df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
name            0
kpl             0
dtype: int64

In [16]:
horse_power_fillna = pd.DataFrame(df['horsepower'].fillna(df['horsepower'].mean()))

In [17]:
horse_power_fillna.isnull().sum()

horsepower    0
dtype: int64

### 13. Mobil apa saja yang memiliki konsumsi bahan bakar tertinggi di setiap negara asal?

In [18]:
mpg.groupby('origin')[['name', 'mpg']].min()

Unnamed: 0_level_0,name,mpg
origin,Unnamed: 1_level_1,Unnamed: 2_level_1
europe,audi 100 ls,16.2
japan,datsun 1200,18.0
usa,amc ambassador brougham,9.0


In [19]:
mpg.loc[mpg.groupby('origin')['mpg'].idxmin()]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
277,16.2,6,163.0,133.0,3410,15.8,78,europe,peugeot 604sl
111,18.0,3,70.0,90.0,2124,13.5,73,japan,maxda rx3
28,9.0,8,304.0,193.0,4732,18.5,70,usa,hi 1200d
