# **Introduction to Pandas II**

Pandas berisi struktur data dan alat manipulasi data yang dirancang untuk pembersihan data dan analisis yang cepat dan mudah dalam Python. Pandas sering digunakan bersama dengan pustaka komputasi numerik seperti NumPy dan SciPy, pustaka analitik seperti statsmodels dan scikit-learn, dan pustaka visualisasi data seperti matplotlib. Pandas mengadopsi secara signifikan bagian dari gaya idiomatis komputasi berbasis array NumPy, terutama berbasis larik dan preferensi untuk pemrosesan data tanpa perulangan.

Sejak menjadi open source pada tahun 2010, pandas telah berkembang menjadi cukup besar yang dapat diterapkan dalam berbagai kasus penggunaan di dunia nyata. Pengembang komunitas telah berkembang menjadi lebih dari 800 kontributor yang berbeda, yang telah membantu membangun proyek ini karena mereka telah menggunakannya untuk memecahkan masalah data sehari-hari.

___

## **1. Inspecting a DataFrame Object**

Kita akan bekerja dengan file `unicorn_companies_raw.csv`, jadi kita perlu import pustaka dan membacanya.

In [238]:
import pandas as pd 
import numpy as np
import datetime as dt 

df = pd.read_csv('unicorn_companies_raw.csv')
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."


Kita dapat memodifikasi opsi tampilan untuk melihat lebih banyak kolom nantinya:

In [239]:
pd.options.display.max_columns=30

Beberapa metode yang biasa digunakan untuk mengecek suatu dataframe di antaranya sebagai berikut.

| Method | Description | 
| --- | --- | 
| `head()` | Return the first n rows. | 
| `tail()` | Returns the last n rows. |
| `sample()` | Return a random sample of items from an axis of object. |
| `info()` | Print a concise summary of a DataFrame. |
| `isna()` | Detect missing values |
| `duplicated()` | Return boolean Series denoting duplicate rows. |
| `any()` | Return whether any element is True, potentially over an axis. |
| `all()` | Return whether all elements are True, potentially over an axis.. |
| `describe()` | Generate descriptive statistics. |

### `Examining dataframes`

Apakah dataframe kosong?

In [240]:
df.empty

False

Berapa ukuran dimensinya?

In [241]:
# menunjukkan ada 1074 baris dan 10 kolom
df.shape

(1074, 10)

Kolom apa saja yang kita miliki?

In [242]:
df.columns

Index(['Company', 'Valuation', 'Date Joined', 'Industry', 'City',
       'Country/Region', 'Continent', 'Year Founded', 'Funding',
       'Select Investors'],
      dtype='object')

Bagaimana dengan indeksnya?

In [243]:
df.index

RangeIndex(start=0, stop=1074, step=1)

Seperti apa bentuk datanya?

Lihat 5 baris dari atas dengan `head()`:

In [244]:
# by default: menampilkan 5 baris teratas
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."


In [245]:
# menampilkan 10 baris teratas
df.head(10)

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."
5,Canva,$40B,2018-01-08,Internet software & services,Surry Hills,Australia,Oceania,2012,$572M,"Sequoia Capital China, Blackbird Ventures, Mat..."
6,Checkout.com,$40B,2019-05-02,Fintech,London,United Kingdom,Europe,2012,$2B,"Tiger Global Management, Insight Partners, DST..."
7,Instacart,$39B,2014-12-30,"Supply chain, logistics, & delivery",San Francisco,United States,North America,2012,$3B,"Khosla Ventures, Kleiner Perkins Caufield & By..."
8,JUUL Labs,$38B,2017-12-20,Consumer & retail,San Francisco,United States,North America,2015,$14B,Tiger Global Management
9,Databricks,$38B,2019-02-05,Data management and analytics,San Francisco,United States,North America,2013,$3B,"Andreessen Horowitz, New Enterprise Associates..."


Lihat 5 baris dari bawah dengan `tail()`:

In [246]:
# by default: menampilkan 5 baris terbawah
df.tail()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
1069,Zhaogang,$1B,2017-06-29,E-commerce & direct-to-consumer,Shanghai,China,Asia,2012,$379M,"K2 Ventures, Matrix Partners China, IDG Capital"
1070,Zhuan Zhuan,$1B,2017-04-18,E-commerce & direct-to-consumer,Beijing,China,Asia,2015,$990M,"58.com, Tencent Holdings"
1071,Zihaiguo,$1B,2021-05-06,Consumer & retail,Chongqing,China,Asia,2018,$80M,"Xingwang Investment Management, China Capital ..."
1072,Zopa,$1B,2021-10-19,Fintech,London,United Kingdom,Europe,2005,$792M,"IAG Capital Partners, Augmentum Fintech, North..."
1073,Zwift,$1B,2020-09-16,E-commerce & direct-to-consumer,Long Beach,United States,North America,2014,$620M,"Novator Partners, True, Causeway Media Partners"


In [247]:
# menampilkan 3 baris terbawah
df.tail(3)

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
1071,Zihaiguo,$1B,2021-05-06,Consumer & retail,Chongqing,China,Asia,2018,$80M,"Xingwang Investment Management, China Capital ..."
1072,Zopa,$1B,2021-10-19,Fintech,London,United Kingdom,Europe,2005,$792M,"IAG Capital Partners, Augmentum Fintech, North..."
1073,Zwift,$1B,2020-09-16,E-commerce & direct-to-consumer,Long Beach,United States,North America,2014,$620M,"Novator Partners, True, Causeway Media Partners"


Lihat baris secara acak dengan `sample()`:

In [248]:
# by default: menampilkan 1 baris secara acak
df.sample()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
935,Human Interest,$1B,2021-08-04,Fintech,San Francisco,United States,North America,2015,$337M,"Wing Venture Capital, Slow Ventures, Uncork Ca..."


In [249]:
# menampilkan 10 baris secara acak
# random_state berfungsi seperti random.seed() pada array agar sample tidak berubah-ubah
df.sample(10, random_state=5)

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
917,FlashEx,$1B,2018-08-27,"Supply chain, logistics, & delivery",Beijing,China,Asia,2014,$359M,"Prometheus Capital, Matrix Partners China, JD ..."
846,Jiuxian,$1B,2015-07-30,E-commerce & direct-to-consumer,Beijing,China,Asia,2009,$250M,"Sequoia Capital China, Rich Land Capital, Merr..."
995,Pentera,$1B,2022-01-11,Cybersecurity,Petah Tikva,Israel,Asia,2015,$190M,"AWZ Ventures, Blackstone, Insight Partners"
562,Formlabs,$2B,2018-08-01,Hardware,Somerville,United States,North America,2011,$251M,"Pitango Venture Capital, DFJ Growth Fund, Foun..."
477,PAX,$2B,2018-10-22,Consumer & retail,San Francisco,United States,North America,2007,$542M,"Tao Capital Partners, Global Asset Capital, Ti..."
895,CommerceIQ,$1B,2022-03-21,Artificial Intelligence,Palo Alto,United States,North America,2012,$196M,"Trinity Ventures, Madrona Venture Group, Shast..."
60,Dunamu,$9B,2021-07-22,Fintech,Seoul,South Korea,Asia,2012,$71M,"Qualcomm Ventures, Woori Investment, Hanwha In..."
682,Gymshark,$1B,2020-08-14,E-commerce & direct-to-consumer,Solihull,United Kingdom,Europe,2012,$262M,General Atlantic
352,wefox,$3B,2019-03-05,Fintech,Berlin,Germany,Europe,2014,$919M,"Salesforce Ventures, Seedcamp, OMERS Ventures"
458,Trader Interactive,$2B,2021-05-12,Other,Norfolk,United States,North America,2017,$624M,Carsales


Bagaimana dengan tipe data? Dan apakah ditemukan nilai null?

In [250]:
# menampilkan tipe data
df.dtypes

Company             object
Valuation           object
Date Joined         object
Industry            object
City                object
Country/Region      object
Continent           object
Year Founded         int64
Funding             object
Select Investors    object
dtype: object

In [251]:
# mengetahui ringkasan dari dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Company           1074 non-null   object
 1   Valuation         1074 non-null   object
 2   Date Joined       1074 non-null   object
 3   Industry          1074 non-null   object
 4   City              1057 non-null   object
 5   Country/Region    1074 non-null   object
 6   Continent         1074 non-null   object
 7   Year Founded      1074 non-null   int64 
 8   Funding           1074 non-null   object
 9   Select Investors  1074 non-null   object
dtypes: int64(1), object(9)
memory usage: 84.0+ KB


In [252]:
# mengecek banyaknya missing values
df.isna().sum()

Company              0
Valuation            0
Date Joined          0
Industry             0
City                17
Country/Region       0
Continent            0
Year Founded         0
Funding              0
Select Investors     0
dtype: int64

Selanjutnya kita akan menampilkan baris-baris dengan missing values.

In [253]:
# any: minimal ada satu yang bernilai True, maka hasilnya True
# all: semua data harus bernilai True, baru hasilnya True
# by default pada axis=0
# mengecek baris yang terdapat missing value
condition = df.isna().any(axis=1)

In [254]:
# menampilkan baris yang memiliki minimal 1 missing values
df_missing_rows = df[condition]
df_missing_rows

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
12,FTX,$32B,2021-07-20,Fintech,,Bahamas,North America,2018,$2B,"Sequoia Capital, Thoma Bravo, Softbank"
169,HyalRoute,$4B,2020-05-26,Mobile & telecommunications,,Singapore,Asia,2015,$263M,Kuang-Chi
241,Moglix,$3B,2021-05-17,E-commerce & direct-to-consumer,,Singapore,Asia,2015,$471M,"Jungle Ventures, Accel, Venture Highway"
250,Trax,$3B,2019-07-22,Artificial intelligence,,Singapore,Asia,2010,$1B,"Hopu Investment Management, Boyu Capital, DC T..."
324,Amber Group,$3B,2021-06-21,Fintech,,Hong Kong,Asia,2015,$328M,"Tiger Global Management, Tiger Brokers, DCM Ve..."
381,Ninja Van,$2B,2021-09-27,"Supply chain, logistics, & delivery",,Singapore,Asia,2014,$975M,"B Capital Group, Monk's Hill Ventures, Dynamic..."
511,ZocDoc,$2B,2015-08-20,Health,,United States,North America,2007,$374M,Founders Fund
540,Advance Intelligence Group,$2B,2021-09-23,Artificial intelligence,,Singapore,Asia,2016,$536M,"Vision Plus Capital, GSR Ventures, ZhenFund"
810,Carousell,$1B,2021-09-15,E-commerce & direct-to-consumer,,Singapore,Asia,2012,$288M,"500 Global, Rakuten Ventures, Golden Gate Vent..."
847,Matrixport,$1B,2021-06-01,Fintech,,Singapore,Asia,2019,$100M,"Dragonfly Captial, Qiming Venture Partners, DS..."


Apakah ditemukan data duplikat?

In [255]:
# menghitung banyaknya duplikat
df.duplicated().sum()

0

Berdasarkan pengujian di atas kita mendapati bahwa tidak ada baris yang nilainya pada tiap kolom sama persis sebagai duplikat. Namun kita akan coba mengecek lebih seksama melihat duplikat berdasarkan kolom `Company`.

In [256]:
df[df.duplicated(subset=['Company'], keep=False)]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
385,BrewDog,$2B,2017-04-10,Consumer & retail,Aberdeen,United Kingdom,Europe,2007,$233M,"TSG Consumer Partners, Crowdcube"
386,BrewDog,$2B,2017-04-10,Consumer & retail,Aberdeen,UnitedKingdom,Europe,2007,$233M,TSG Consumer Partners
510,ZocDoc,$2B,2015-08-20,Health,New York,United States,North America,2007,$374M,"Founders Fund, Khosla Ventures, Goldman Sachs"
511,ZocDoc,$2B,2015-08-20,Health,,United States,North America,2007,$374M,Founders Fund
1031,SoundHound,$1B,2018-05-03,Artificial intelligence,Santa Clara,United States,North America,2005,$215M,"Tencent Holdings, Walden Venture Capital, Glob..."
1032,SoundHound,$1B,2018-05-03,Other,Santa Clara,United States,North America,2005,$215M,Tencent Holdings


### `Describing and Summarizing`

Dapatkan ringkasan statistik dari kolom-kolom numerik

In [257]:
df.describe()

# count:  jumlah data yang bukan NaN
# mean: rata-rata
# std: standar deviasi
# min: nilai minimum
# 25%: quartile 1
# 50%: quartile 2 atau median
# 75%: quartile 3
# max: nilai maksimum

Unnamed: 0,Year Founded
count,1074.0
mean,2012.870577
std,5.705494
min,1919.0
25%,2011.0
50%,2014.0
75%,2016.0
max,2021.0


Ringkasan spesifik pada persentile ke-5 dan persentil ke-95

In [258]:
df.describe(percentiles=[0.05,0.95])

Unnamed: 0,Year Founded
count,1074.0
mean,2012.870577
std,5.705494
min,1919.0
5%,2003.0
50%,2014.0
95%,2019.0
max,2021.0


Ringkasan statistik dari kolom-kolom kategorikal

In [259]:
df.describe(include='object')

# count: jumlah data yang tidak NaN
# unique: banyaknya nilai yang unik
# top: modus (nilai yang paling banyak muncul)
# freq: frekuensi kemunculan dari modus

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Funding,Select Investors
count,1074,1074,1074,1074,1057,1074,1074,1074,1074
unique,1071,30,638,18,256,47,6,538,1059
top,BrewDog,$1B,2021-07-13,Fintech,San Francisco,United States,North America,$1B,Sequoia Capital
freq,2,472,9,204,149,561,588,59,3


Ada juga metode untuk statistik tertentu. Berikut ini adalah beberapa contohnya.

| Method | Description | Data types |
| --- | --- | --- |
| `count()` | The number of non-null observations | Any |
| `value_counts()` | The total of each unique value we have | Any |
| `unique()` | Show unique values | Any |
| `nunique()` | The number of unique values | Any |
| `sum()` | The total of the values | Numerical or Boolean |
| `mean()` | The average of the values | Numerical or Boolean |
| `median()` | The median of the values | Numerical |
| `min()` | The minimum of the values | Numerical |
| `idxmin()` | The index where the minimum values occurs | Numerical |
| `max()` | The maximum of the values | Numerical |
| `idxmax()` | The index where the maximum values occurs | Numerical |
| `abs()` | The absolute values of the data | Numerical |
| `std()` | The standard deviation | Numerical |
| `var()` | The variance | Numerical |
| `cov()` | The covariance between two `Series`, or a covariance matrix for all column combinations in a `DataFrame` | Numerical |
| `corr()` | The correlation between two `Series`, or a correlation matrix for all column combinations in a `DataFrame` | Numerical |
| `quantile()` | Calculates a specific quantile | Numerical |
| `cumsum()` | The cumulative sum | Numerical or Boolean |
| `cummin()` | The cumulative minimum | Numerical |
| `cummax()` | The cumulative maximum | Numerical |

Misalnya, apakah ada unik value di kolom `Continent`? Berapa jumlahnya?

In [260]:
# mengetahui unik value --> unique()
df['Continent'].unique()

array(['Asia', 'North America', 'Europe', 'Oceania', 'South America',
       'Africa'], dtype=object)

In [261]:
# mengetahui banyaknya nilai unik
len(df['Continent'].unique())

6

In [262]:
# mengetahui banyaknya nilai unik
df['Continent'].nunique()

6

Berapa median `Year Founded` dari perusahaan yang berada di `Continent` Europe?

In [263]:
df[df['Continent']=='Europe']['Year Founded'].median()

2013.5

Apa saja industri yang tercakup di dalam dataset?

In [264]:
df['Industry'].unique()

array(['Artificial intelligence', 'Other',
       'E-commerce & direct-to-consumer', 'FinTech', 'Fintech',
       'Internet software & services',
       'Supply chain, logistics, & delivery', 'Consumer & retail',
       'Data management and analytics', 'Edtech', 'Health', 'Hardware',
       'Auto & transportation', 'Travel', 'Cybersecurity',
       'Mobile & telecommunications', 'Data management & analytics',
       'Artificial Intelligence'], dtype=object)

Berapa banyak perusahaan pada masing-masing industry?

In [265]:
df['Industry'].value_counts()

Fintech                                204
Internet software & services           204
E-commerce & direct-to-consumer        111
Health                                  75
Artificial intelligence                 73
Other                                   59
Supply chain, logistics, & delivery     57
Cybersecurity                           50
Mobile & telecommunications             38
Data management & analytics             35
Hardware                                34
Auto & transportation                   31
Edtech                                  28
Consumer & retail                       26
FinTech                                 19
Travel                                  14
Artificial Intelligence                 10
Data management and analytics            6
Name: Industry, dtype: int64

In [266]:
# normalize: menghitung proporsi sehingga totalnya sama dengan 1 (100%)
# round: untuk membulatkan
(df['Industry'].value_counts(normalize=True)*100).round(2)

Fintech                                18.99
Internet software & services           18.99
E-commerce & direct-to-consumer        10.34
Health                                  6.98
Artificial intelligence                 6.80
Other                                   5.49
Supply chain, logistics, & delivery     5.31
Cybersecurity                           4.66
Mobile & telecommunications             3.54
Data management & analytics             3.26
Hardware                                3.17
Auto & transportation                   2.89
Edtech                                  2.61
Consumer & retail                       2.42
FinTech                                 1.77
Travel                                  1.30
Artificial Intelligence                 0.93
Data management and analytics           0.56
Name: Industry, dtype: float64

Ada berapa banyak `Country/Region` pada dataset?

In [267]:
df['Country/Region'].nunique()

47

Perusahaan tertua yang menjadi Unicorn didirikan pada tahun berapa?

In [268]:
df['Year Founded'].min()

1919

## **2. Data Cleaning**

Beberapa metode yang akan kita gunakan untuk membersihkan suatu dataset di antaranya sebagai berikut.

| Method | Description | 
| --- | --- | 
| `replace()` | Replace values. | 
| `drop()` | Drop specified labels from rows or columns. |
| `rename()` | Rename columns or index labels. |
| `assign()` | Assign new columns to a DataFrame. |
| `apply()` | Apply a function along an axis of the DataFrame. |
| `dropna()` | Remove missing values. |
| `fillna()` | Fill NA/NaN values using the specified method. |
| `drop_duplicates()` | Return DataFrame with duplicate rows removed. |
| `reset_index()` | Reset the index, or a level of it. |

### `Type correction`

Apakah ada yang berbeda dengan tipe datanya? Seharusnya `Date Joined` harus disimpan sebagai waktu. Mari kita perbaiki ini.

In [269]:
# pd.to_datetime: mengubah tipe data menjadi datetime
df['Date Joined'] = pd.to_datetime(df['Date Joined'])

In [270]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Company           1074 non-null   object        
 1   Valuation         1074 non-null   object        
 2   Date Joined       1074 non-null   datetime64[ns]
 3   Industry          1074 non-null   object        
 4   City              1057 non-null   object        
 5   Country/Region    1074 non-null   object        
 6   Continent         1074 non-null   object        
 7   Year Founded      1074 non-null   int64         
 8   Funding           1074 non-null   object        
 9   Select Investors  1074 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 84.0+ KB


### `Typo Correction`

Pada kolom `Industry`, terdapat beberapa data yang mengalami kesalahan penulisan. Daftar industri yang benar berada dalam list berikut.

In [271]:
industry_list = ['Artificial intelligence', 'Other','E-commerce & direct-to-consumer', 'Fintech',\
       'Internet software & services','Supply chain, logistics, & delivery', 'Consumer & retail',\
       'Data management & analytics', 'Edtech', 'Health', 'Hardware','Auto & transportation', \
        'Travel', 'Cybersecurity','Mobile & telecommunications']

Kita akan menampilkan nilai pada kolom `Industry` yang tidak tercantum di dalam industry_list.

In [272]:
for industry in df['Industry'].unique():
    if industry not in industry_list:
        print(industry)

FinTech
Data management and analytics
Artificial Intelligence


In [273]:
set(df['Industry'].unique()).difference(set(industry_list))

{'Artificial Intelligence', 'Data management and analytics', 'FinTech'}

Kita akan coba memperbaiki data yang salah pada kolom `Industry` tersebut.

In [274]:
# membuat dictionary berisi {'nilai_lama' : 'nilai_baru'}
replacement_dict = {
    'Artificial Intelligence' : 'Artificial intelligence',
    'Data management and analytics': 'Data management & analytics',
    'FinTech': 'Fintech'
}

df['Industry'] = df['Industry'].replace(replacement_dict)

In [275]:
# bisa juga satu persatu seperti ini
df['Industry'].replace('FinTech', 'Fintech')

0               Artificial intelligence
1                                 Other
2       E-commerce & direct-to-consumer
3                               Fintech
4                               Fintech
                     ...               
1069    E-commerce & direct-to-consumer
1070    E-commerce & direct-to-consumer
1071                  Consumer & retail
1072                            Fintech
1073    E-commerce & direct-to-consumer
Name: Industry, Length: 1074, dtype: object

Banyaknya data yang salah pada kolom `Industry` setelah dilakukan perubahan adalah sebagai berikut:

In [276]:
len(set(df['Industry'].unique()).difference(set(industry_list)))

0

Kita akan mengubah 'UnitedKingdom' menjadi 'United Kingdom' pada kolom `Country/Region`

In [277]:
df['Country/Region'] = df['Country/Region'].replace('UnitedKingdom', 'United Kingdom')

### `Drop unnecesary features`

Mari kita mulai dengan menghapus kolom yang tidak akan kita gunakan dalam analisis ini yaitu kolom `Funding`.

In [278]:
# agar perubahannya permanen, jangan lupa diassign kembali
# atau gunakan inplace=True 
df.drop(columns=['Funding'], inplace=True)

In [279]:
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita..."


### `Fixing feature name`


Selanjutnya, lakukan rename untuk kolom `Country/Region` menjadi `Country`:

In [280]:
df.rename(columns={'Country/Region':'Country'}, inplace=True)

In [281]:
df.columns

Index(['Company', 'Valuation', 'Date Joined', 'Industry', 'City', 'Country',
       'Continent', 'Year Founded', 'Select Investors'],
      dtype='object')

### `Add new feature`

Mari kita buat kolom baru yaitu:

1. Year Joined
2. Years_To_Unicorn
3. Valuation Number (in B)
4. Valuation Class
5. Number Investors

**Year Joined**

Kolom `Year Joined` yang merupakan informasi tahun dari kolom `Date Joined`.

In [282]:
df['Year Joined'] = df['Date Joined'].dt.year
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011


**Years_To_Unicorn**

Kolom `Years_To_Unicorn` yang merupakan lamanya waktu dari sejak perusahaan ditemukan hingga bergabung sebagai perusahaan dengan status unicorn.

In [283]:
df['Years_To_Unicorn'] = df['Year Joined'] - df['Year Founded']
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6


In [284]:
# alternatif lain menambahkan kolom bisa menggunakan assign
# syntax
# df.assign(
#   kolom_baru1=fungsi1,
#   kolom_baru2=fungsi2  
# )

df = df.assign(
    Years_To_Unicorn=lambda x: x['Year Joined'] - x['Year Founded']
)

Kita tampilkan informasi statistik dari kolom `Years_To_Unicorn`

In [285]:
df['Years_To_Unicorn'].describe()

count    1074.000000
mean        7.013035
std         5.331842
min        -3.000000
25%         4.000000
50%         6.000000
75%         9.000000
max        98.000000
Name: Years_To_Unicorn, dtype: float64

Dapat dilihat bahwa terdapat baris dengan nilai 'Years_To_Unicorn' negatif.

In [286]:
df[df['Years_To_Unicorn']<0]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn
527,InVision,$2B,2017-11-01,Internet software & services,New York,United States,North America,2020,"FirstMark Capital, Tiger Global Management, IC...",2017,-3


Berdasarkan hasil pencarian di Internet diketahui bahwa perusahaan tersebut didirikan pada tahun 2011. Kita ganti nilai 'Year Founded' dari perusahaan tersebut menjadi 2011.

In [287]:
df.loc[527, 'Year Founded'] = 2011
df.loc[[527]]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn
527,InVision,$2B,2017-11-01,Internet software & services,New York,United States,North America,2011,"FirstMark Capital, Tiger Global Management, IC...",2017,-3


Kita lakukan perhitungan ulang untuk kolom 'Years_To_Unicorn' dan cek bahwa kini sudah tidak ada nilai negatif pada kolom tersebut.

In [288]:
df['Years_To_Unicorn'] = df['Year Joined'] - df['Year Founded']
df['Years_To_Unicorn'].describe()

count    1074.000000
mean        7.021415
std         5.323155
min         0.000000
25%         4.000000
50%         6.000000
75%         9.000000
max        98.000000
Name: Years_To_Unicorn, dtype: float64

**Valuation Number (in B)**

Kolom `Valuation Number (in B)` berisi nilai berupa bilangan bulat dari kolom `Valuation`

In [289]:
def str_to_num(valuation):
    valuation = valuation.strip('$B')
    valuation = int(valuation)
    return valuation

In [290]:
# contoh apply menggunakan regular function
df['Valuation Number (in B)'] = df['Valuation'].apply(str_to_num)
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B)
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46


In [291]:
# contoh apply menggunakan lambda function
df['Valuation'].apply(lambda x: int(x[1:-1]))

0       180
1       100
2       100
3        95
4        46
       ... 
1069      1
1070      1
1071      1
1072      1
1073      1
Name: Valuation, Length: 1074, dtype: int64

**Valuation Class**

Kolom `Valuation Class` bernilai `Low` jika perusahaan berada di 50% terbawah dari nilai valuasi dan `High` jika perusahaan berada di 50% teratas.

In [292]:
med = df['Valuation Number (in B)'].median()
df['Valuation Class'] = df['Valuation Number (in B)'].apply(lambda x: 'High' if x > med else 'Low')
df.head()

# catatan cara ini berisiko jika ternyata data setelah median masih ada yang bernilai 2

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180,High
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100,High
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100,High
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95,High
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46,High


In [293]:
# membagi kedalam kelompok sama besar
# syntax pd.qcut(data, jumlah_kelompok, label_sesuai_jumlah_kelompok)
df['Valuation Class'] = pd.qcut(df['Valuation Number (in B)'], 2, ['Low', 'High'])

**Number Investors**

Kolom `Number Investors` berisi nilai berupa banyaknya investor dari kolom `Select Investors`

In [294]:
df['Number Investors'] = df['Select Investors'].apply(lambda x: len(x.split(', ')))
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180,High,4
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100,High,3
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100,High,3
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95,High,3
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46,High,3


### `Handling Missing Values`

Terdapat beberapa cara untuk mengatasi missing value, yang sangat penting dalam EDA. Dua metode utama adalah menghapusnya atau melakukan imputasi/memasukkan nilai-nilai lain sebagai gantinya. Pemilihan metode yang tepat bergantung pada masalah bisnis dan nilai tambah dari solusi yang diberikan.

Di sini, kita akan mencoba keduanya.

**Menghapus missing values**

Untuk membandingkan efek dari tindakan yang berbeda, pertama-tama simpan banyaknya data asli dalam sebuah variabel. Buat variabel bernama `count_total` yang merupakan bilangan bulat yang mewakili jumlah total nilai di `df`. Misalnya, jika dataframe memiliki 5 baris dan 2 kolom, maka angkanya adalah 10.

In [295]:
# banyaknya data hasil kali baris dan kolom
count_total = df.size
count_total

15036

Sekarang, hapus semua baris yang berisi missing value dan simpan banyaknya data yang tersisa dalam variabel bernama `count_dropna_rows`

In [296]:
# menghapus keseluruhan baris mengkipun missing valuenya hanya ada pada 1 kolom
count_dropna_rows = df.dropna(axis=0).size
count_dropna_rows

14798

Sekarang, hapus semua kolom yang berisi missing value dan simpan banyaknya data yang tersisa dalam variabel bernama `count_dropna_columns`.

In [297]:
count_dropna_columns = df.dropna(axis=1).size
count_dropna_columns

13962

Selanjutnya, cetak persentase nilai yang dihapus oleh setiap metode dan bandingkan.

In [298]:
print(f'Persentase setelah drop baris {(count_total - count_dropna_rows)/count_total * 100:.2f}%')
print(f'Persentase setelah drop kolom {(count_total - count_dropna_columns)/count_total * 100:.2f}%')

Persentase setelah drop baris 1.58%
Persentase setelah drop kolom 7.14%


Metode yang lebih efektif adalah dengan menghapus baris karena jumlah data yang terbuang lebih sedikit dibandingkan menghapus kolom.

**Mengisi missing values**

Sekarang, kita akan mempraktikan metode kedua dengan cara: imputasi. Kita bisa mengisi missing values menggunakan metode yang ada pada DataFrame yaitu [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna). Misal kita akan mengisi missing values pada kolom `city` dengan modusnya.

In [304]:
modus_city = df['City'].mode()[0]
modus_city

'San Francisco'

In [300]:
# imputasi dengan modus_city
df_fillna = df.copy()
df_fillna['City'] = df['City'].fillna(modus_city )

In [308]:
df_fillna.loc[df_missing_rows.index]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
12,FTX,$32B,2021-07-20,Fintech,San Francisco,Bahamas,North America,2018,"Sequoia Capital, Thoma Bravo, Softbank",2021,3,32,High,3
169,HyalRoute,$4B,2020-05-26,Mobile & telecommunications,San Francisco,Singapore,Asia,2015,Kuang-Chi,2020,5,4,High,1
241,Moglix,$3B,2021-05-17,E-commerce & direct-to-consumer,San Francisco,Singapore,Asia,2015,"Jungle Ventures, Accel, Venture Highway",2021,6,3,High,3
250,Trax,$3B,2019-07-22,Artificial intelligence,San Francisco,Singapore,Asia,2010,"Hopu Investment Management, Boyu Capital, DC T...",2019,9,3,High,3
324,Amber Group,$3B,2021-06-21,Fintech,San Francisco,Hong Kong,Asia,2015,"Tiger Global Management, Tiger Brokers, DCM Ve...",2021,6,3,High,3
381,Ninja Van,$2B,2021-09-27,"Supply chain, logistics, & delivery",San Francisco,Singapore,Asia,2014,"B Capital Group, Monk's Hill Ventures, Dynamic...",2021,7,2,Low,3
511,ZocDoc,$2B,2015-08-20,Health,San Francisco,United States,North America,2007,Founders Fund,2015,8,2,Low,1
540,Advance Intelligence Group,$2B,2021-09-23,Artificial intelligence,San Francisco,Singapore,Asia,2016,"Vision Plus Capital, GSR Ventures, ZhenFund",2021,5,2,Low,3
810,Carousell,$1B,2021-09-15,E-commerce & direct-to-consumer,San Francisco,Singapore,Asia,2012,"500 Global, Rakuten Ventures, Golden Gate Vent...",2021,9,1,Low,3
847,Matrixport,$1B,2021-06-01,Fintech,San Francisco,Singapore,Asia,2019,"Dragonfly Captial, Qiming Venture Partners, DS...",2021,2,1,Low,3


Hasil dari imputasinya mennjadi tidak masuk akal, dimana tidak ada kota bernama San Fransisco di Bahamas, Singapore, dan Hong Kong.

Pilihan lainnya adalah mengisi nilai dengan nilai tertentu, misalnya 'Unknown'. Namun, hal ini tidak akan menambah nilai apa pun pada kumpulan data dan dapat mempersulit pencarian missing values di masa mendatang.

Sementara kita putuskan menghapus baris yang memiliki missing values

In [310]:
# menghapus baris yang memiliki missing values
df = df.dropna()

### `Handling Duplicates`

Selanjutnya kita akan menangani data duplikat. Setiap dataset adalah unik dan kita tidak dapat memperlakukan setiap dataset dengan cara yang sama. Saat kita membuat keputusan apakah akan menghilangkan nilai duplikat atau tidak, kita harus memikirkan secara mendalam tentang kumpulan data itu sendiri dan tujuan yang ingin kita capai.

1. Memutuskan untuk membuang duplikat

    Kita harus menghapus atau menghilangkan nilai duplikat jika nilai duplikat jelas merupakan kesalahan atau akan salah menggambarkan nilai unik yang tersisa dalam kumpulan data.
 
    Misalnya, kita cukup yakin bahwa seorang profesional data akan (dalam banyak kasus) menghilangkan nilai duplikat dari kumpulan data yang berisi alamat rumah dan harga rumah. Menghitung rumah yang sama dua kali (dalam banyak kasus) akan salah menggambarkan kesimpulan yang diambil dari kumpulan data secara keseluruhan, seperti rata-rata harga rumah, total harga rumah, atau bahkan jumlah total rumah. Dalam kasus seperti ini, seorang profesional data hampir pasti akan menghilangkan data duplikat agar dapat mewakili data yang tersisa secara adil selama analisis dan visualisasi.

2. Memutuskan untuk TIDAK membuang duplikat

    Kita harus menyimpan data duplikat dalam kumpulan data kita jika nilai duplikat tersebut jelas bukan kesalahan dan harus diperhitungkan saat mewakili kumpulan data secara keseluruhan.
 
    Misalnya, kumpulan data yang menandai jumlah lemparan dan jarak seorang atlet tolak peluru Olimpiade dalam latihan kemungkinan besar akan mencakup beberapa jarak duplikat; hanya berdasarkan jumlah percobaan dan batasan seseorang dapat memiliki bola berbobot, akan ada nilai duplikat—terutama jika pengukuran jarak diberi label hanya pada 1 atau 2 tempat desimal. Dalam kasus seperti ini, seorang profesional data hampir pasti akan menyimpan semua data agar cukup mewakilinya secara keseluruhan selama analisis dan visualisasi.

Pada kesempatan ini, kita akan menangani data duplikat dengan cara dihapus. Bentuk data sebelum dilakukan penghapusan data duplikat adalah sebagai berikut:

In [311]:
df.shape

(1057, 14)

Untuk setiap data duplikat, kita akan mempertahankan baris dengan kemunculan pertama dan menghapus baris pada kemunculan berikutnya

In [314]:
# menghapus baris yang kolom company-nya sama
df = df.drop_duplicates(subset='Company')

Bentuk data setelah dilakukan penghapusan data duplikat adalah sebagai berikut:

In [315]:
df.shape

(1055, 14)

Setelah menghapus duplikat, erkadang kita ingin mengatur ulang indeks kita ke nomor baris dan mengembalikan kolom. Kita dapat melakukannya dengan metode `reset_index()`

In [319]:
df = df.reset_index(drop=True)

## **3. Data Sorting**

Beberapa metode yang akan kita gunakan untuk mengurutkan di antaranya sebagai berikut.

| Method | Description | 
| --- | --- | 
| `sort_values()` | Sort by the values along either axis. | 
| `sort_index()` | Sort object by labels (along an axis). |
| `nlargest()` | Return the first n rows ordered by columns in descending order. |
| `nsmallest()` | Return the first n rows ordered by columns in ascending order. |

Kita dapat menggunakan metode `sort_values()` untuk mengurutkan berdasarkan sejumlah kolom:

In [323]:
# mengurutkan berdasarkan Years_To_Unicorn
# ascending=True --> dari terkecil ke terbesar, A - Z
# ascending=False --> dari terbesar ke terkecil, Z - A
df = df.sort_values('Years_To_Unicorn', ascending=True)
df

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
159,Ola Electric Mobility,$5B,2019-07-02,Auto & transportation,Bengaluru,India,Asia,2019,"SoftBank Group, Tiger Global Management, Matri...",2019,0,5,High,3
536,Avant,$2B,2012-12-17,Artificial intelligence,Chicago,United States,North America,2012,"RRE Ventures, Tiger Global, August Capital",2012,0,2,Low,3
389,candy.com,$2B,2021-10-21,Fintech,New York,United States,North America,2021,"Insight Partners, Softbank Group, Connect Vent...",2021,0,2,Low,3
309,Flink Food,$3B,2021-12-01,E-commerce & direct-to-consumer,Berlin,Germany,Europe,2021,"Mubadala Capital, Bond, Prosus Ventures",2021,0,3,High,3
544,ClickHouse,$2B,2021-10-28,Data management & analytics,Portola Valley,United States,North America,2021,"Lightspeed Venture Partners, Almaz Capital Par...",2021,0,2,Low,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,Epic Games,$32B,2018-10-26,Other,Cary,United States,North America,1991,"Tencent Holdings, KKR, Smash Ventures",2018,27,32,High,3
1025,Thirty Madison,$1B,2021-06-02,Health,New York,United States,North America,1993,"Northzone Ventures, Maveron, Johnson & Johnson...",2021,28,1,Low,3
367,Promasidor Holdings,$2B,2016-11-08,Consumer & retail,Bryanston,South Africa,Asia,1979,"IFC, Ajinomoto",2016,37,2,Low,2
689,Five Star Business Finance,$1B,2021-03-26,Other,Chennai,India,Asia,1984,"Sequoia Capital India, Tiger Global Management...",2021,37,1,Low,3


In [324]:
df = df.sort_values(['Years_To_Unicorn', 'Valuation Number (in B)'], ascending=[True, True])
df

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
765,Jokr,$1B,2021-12-02,E-commerce & direct-to-consumer,New York,United States,North America,2021,"GGV Capital, Tiger Global Management, Greycroft",2021,0,1,Low,3
811,GlobalBees,$1B,2021-12-28,E-commerce & direct-to-consumer,New Delhi,India,Asia,2021,"Chiratae Ventures, SoftBank Group, Trifecta Ca...",2021,0,1,Low,3
952,Mensa Brands,$1B,2021-11-16,Other,Bengaluru,India,Asia,2021,"Accel, Falcon Edge Capital, Norwest Venture Pa...",2021,0,1,Low,3
983,Playco,$1B,2020-09-21,Other,Tokyo,Japan,Asia,2020,"Sozo Ventures, Caffeinated Capital, Sequoia Ca...",2020,0,1,Low,3
536,Avant,$2B,2012-12-17,Artificial intelligence,Chicago,United States,North America,2012,"RRE Ventures, Tiger Global, August Capital",2012,0,2,Low,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,Epic Games,$32B,2018-10-26,Other,Cary,United States,North America,1991,"Tencent Holdings, KKR, Smash Ventures",2018,27,32,High,3
1025,Thirty Madison,$1B,2021-06-02,Health,New York,United States,North America,1993,"Northzone Ventures, Maveron, Johnson & Johnson...",2021,28,1,Low,3
689,Five Star Business Finance,$1B,2021-03-26,Other,Chennai,India,Asia,1984,"Sequoia Capital India, Tiger Global Management...",2021,37,1,Low,3
367,Promasidor Holdings,$2B,2016-11-08,Consumer & retail,Bryanston,South Africa,Asia,1979,"IFC, Ajinomoto",2016,37,2,Low,2


Untuk memilih baris terbesar dan terkecil, gunakan `nlargest()` dan `nsmallest()` sebagai gantinya. Dengan melihat waktu terlama untuk menjadi unicorn:

In [326]:
# menampilkan 5 data terbesar berdasarkan Years_To_Unicorn
df.nlargest(5, 'Years_To_Unicorn')

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
186,Otto Bock HealthCare,$4B,2017-06-24,Health,Duderstadt,Germany,Europe,1919,EQT Partners,2017,98,4,High,1
689,Five Star Business Finance,$1B,2021-03-26,Other,Chennai,India,Asia,1984,"Sequoia Capital India, Tiger Global Management...",2021,37,1,Low,3
367,Promasidor Holdings,$2B,2016-11-08,Consumer & retail,Bryanston,South Africa,Asia,1979,"IFC, Ajinomoto",2016,37,2,Low,2
1025,Thirty Madison,$1B,2021-06-02,Health,New York,United States,North America,1993,"Northzone Ventures, Maveron, Johnson & Johnson...",2021,28,1,Low,3
829,Radius Payment Solutions,$1B,2017-11-27,Fintech,Crewe,United Kingdom,Europe,1990,Inflexion Private Equity,2017,27,1,Low,1


Karena kita memiliki sampel dari set data lengkap, mari kita urutkan kembali baris berdasarkan indeks. kolom berdasarkan abjad:

In [329]:
df.sort_index(axis=0, inplace=True)
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180,High,4
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100,High,3
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100,High,3
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95,High,3
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46,High,3


Kita juga bisa mengurutkan kolom berdasarkan abjad:

In [330]:
df.sort_index(axis=1, inplace=True)
df.head()

Unnamed: 0,City,Company,Continent,Country,Date Joined,Industry,Number Investors,Select Investors,Valuation,Valuation Class,Valuation Number (in B),Year Founded,Year Joined,Years_To_Unicorn
0,Beijing,Bytedance,Asia,China,2017-04-07,Artificial intelligence,4,"Sequoia Capital China, SIG Asia Investments, S...",$180B,High,180,2012,2017,5
1,Hawthorne,SpaceX,North America,United States,2012-12-01,Other,3,"Founders Fund, Draper Fisher Jurvetson, Rothen...",$100B,High,100,2002,2012,10
2,Shenzhen,SHEIN,Asia,China,2018-07-03,E-commerce & direct-to-consumer,3,"Tiger Global Management, Sequoia Capital China...",$100B,High,100,2008,2018,10
3,San Francisco,Stripe,North America,United States,2014-01-23,Fintech,3,"Khosla Ventures, LowercaseCapital, capitalG",$95B,High,95,2010,2014,4
4,Stockholm,Klarna,Europe,Sweden,2011-12-12,Fintech,3,"Institutional Venture Partners, Sequoia Capita...",$46B,High,46,2005,2011,6
