<a href="https://colab.research.google.com/github/harishmuh/Python-for-Data-Science-Analysis/blob/main/Introduction_to_pandas_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Pandas II**

Pandas contains data structures and data manipulation tools designed for fast and easy data cleaning and analysis in Python. Pandas is often used in conjunction with numerical computing libraries such as NumPy and SciPy, analytical libraries such as statsmodels and scikit-learn, and data visualization libraries such as matplotlib. Pandas adopts significant portions of NumPy's idiomatic array-based computing style, particularly its array-based nature and preference for non-looping data processing.

Since becoming open source in 2010, pandas has grown to a sizeable size that can be applied to a wide range of real-world use cases. The developer community has grown to over 800 diverse contributors, who have helped build the project as they have used it to solve everyday data problems.

___

## **1. Inspecting a DataFrame Object**

We will be working with the file `unicorn_companies_raw.csv` from [here](https://github.com/harishmuh/Python-for-Data-Science-Analysis/blob/main/datasets/unicorn_companies_raw.csv), so we need to import the library and read it.

In [1]:
# Importing library
import pandas as pd
import numpy as np
import datetime as dt

# Loading dataset
url = 'https://raw.githubusercontent.com/harishmuh/Python-for-Data-Science-Analysis/refs/heads/main/datasets/unicorn_companies_raw.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."


We can modify the display options to see more columns later:

In [2]:
# Setting: Displaying 30 columns
pd.options.display.max_columns=30

Several methods commonly used to check a dataframe include the following.

| Method | Description |
| --- | --- |
| `head()` | Return the first n rows. |
| `tail()` | Returns the last n rows. |
| `sample()` | Return a random sample of items from an axis of object. |
| `info()` | Print a concise summary of a DataFrame. |
| `isna()` | Detect missing values ​​|
| `duplicated()` | Return boolean Series denoting duplicate rows. |
| `any()` | Return whether any element is True, potentially over an axis. |
| `all()` | Return whether all elements are True, potentially over an axis.. |
| `describe()` | Generate descriptive statistics. |

### `Examining dataframes`

Is the dataframe empty?

In [3]:
df.empty

False

What are the dimensions?

In [4]:
# shows there are 1074 rows and 10 columns
df.shape

(1074, 10)

What columns do we have?

In [5]:
df.columns

Index(['Company', 'Valuation', 'Date Joined', 'Industry', 'City',
       'Country/Region', 'Continent', 'Year Founded', 'Funding',
       'Select Investors'],
      dtype='object')

What about the index?

In [6]:
df.index

RangeIndex(start=0, stop=1074, step=1)

What does the data look like?

Look at the top 5 lines with `head()`:

In [7]:
# by default: displays the top 5 rows
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."


In [8]:
# display top 10 lines
df.head(10)

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,$8B,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,$7B,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,$2B,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,FinTech,San Francisco,United States,North America,2010,$2B,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,$4B,"Institutional Venture Partners, Sequoia Capita..."
5,Canva,$40B,2018-01-08,Internet software & services,Surry Hills,Australia,Oceania,2012,$572M,"Sequoia Capital China, Blackbird Ventures, Mat..."
6,Checkout.com,$40B,2019-05-02,Fintech,London,United Kingdom,Europe,2012,$2B,"Tiger Global Management, Insight Partners, DST..."
7,Instacart,$39B,2014-12-30,"Supply chain, logistics, & delivery",San Francisco,United States,North America,2012,$3B,"Khosla Ventures, Kleiner Perkins Caufield & By..."
8,JUUL Labs,$38B,2017-12-20,Consumer & retail,San Francisco,United States,North America,2015,$14B,Tiger Global Management
9,Databricks,$38B,2019-02-05,Data management and analytics,San Francisco,United States,North America,2013,$3B,"Andreessen Horowitz, New Enterprise Associates..."


Look 5 lines from the bottom with `tail()`:

In [9]:
# by default: displays the bottom 5 rows
df.tail()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
1069,Zhaogang,$1B,2017-06-29,E-commerce & direct-to-consumer,Shanghai,China,Asia,2012,$379M,"K2 Ventures, Matrix Partners China, IDG Capital"
1070,Zhuan Zhuan,$1B,2017-04-18,E-commerce & direct-to-consumer,Beijing,China,Asia,2015,$990M,"58.com, Tencent Holdings"
1071,Zihaiguo,$1B,2021-05-06,Consumer & retail,Chongqing,China,Asia,2018,$80M,"Xingwang Investment Management, China Capital ..."
1072,Zopa,$1B,2021-10-19,Fintech,London,United Kingdom,Europe,2005,$792M,"IAG Capital Partners, Augmentum Fintech, North..."
1073,Zwift,$1B,2020-09-16,E-commerce & direct-to-consumer,Long Beach,United States,North America,2014,$620M,"Novator Partners, True, Causeway Media Partners"


In [10]:
# display bottom 3 lines
df.tail(3)

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
1071,Zihaiguo,$1B,2021-05-06,Consumer & retail,Chongqing,China,Asia,2018,$80M,"Xingwang Investment Management, China Capital ..."
1072,Zopa,$1B,2021-10-19,Fintech,London,United Kingdom,Europe,2005,$792M,"IAG Capital Partners, Augmentum Fintech, North..."
1073,Zwift,$1B,2020-09-16,E-commerce & direct-to-consumer,Long Beach,United States,North America,2014,$620M,"Novator Partners, True, Causeway Media Partners"


View random rows with `sample()`:

In [11]:
# by default: displays 1 row randomly
df.sample()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
137,Lendable,$5B,2021-03-31,Fintech,London,United Kingdom,Europe,2014,$286M,"Ontario Teachers' Pension Plan, Goldman Sachs"


In [12]:
# display 10 rows randomly
# random_state works like random.seed() on arrays so that the sample doesn't change
df.sample(10, random_state=5)

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
917,FlashEx,$1B,2018-08-27,"Supply chain, logistics, & delivery",Beijing,China,Asia,2014,$359M,"Prometheus Capital, Matrix Partners China, JD ..."
846,Jiuxian,$1B,2015-07-30,E-commerce & direct-to-consumer,Beijing,China,Asia,2009,$250M,"Sequoia Capital China, Rich Land Capital, Merr..."
995,Pentera,$1B,2022-01-11,Cybersecurity,Petah Tikva,Israel,Asia,2015,$190M,"AWZ Ventures, Blackstone, Insight Partners"
562,Formlabs,$2B,2018-08-01,Hardware,Somerville,United States,North America,2011,$251M,"Pitango Venture Capital, DFJ Growth Fund, Foun..."
477,PAX,$2B,2018-10-22,Consumer & retail,San Francisco,United States,North America,2007,$542M,"Tao Capital Partners, Global Asset Capital, Ti..."
895,CommerceIQ,$1B,2022-03-21,Artificial Intelligence,Palo Alto,United States,North America,2012,$196M,"Trinity Ventures, Madrona Venture Group, Shast..."
60,Dunamu,$9B,2021-07-22,Fintech,Seoul,South Korea,Asia,2012,$71M,"Qualcomm Ventures, Woori Investment, Hanwha In..."
682,Gymshark,$1B,2020-08-14,E-commerce & direct-to-consumer,Solihull,United Kingdom,Europe,2012,$262M,General Atlantic
352,wefox,$3B,2019-03-05,Fintech,Berlin,Germany,Europe,2014,$919M,"Salesforce Ventures, Seedcamp, OMERS Ventures"
458,Trader Interactive,$2B,2021-05-12,Other,Norfolk,United States,North America,2017,$624M,Carsales


What about data types? And are null values ​​found?

In [14]:
# display data types
df.dtypes.to_frame('Data Type')

Unnamed: 0,Data Type
Company,object
Valuation,object
Date Joined,object
Industry,object
City,object
Country/Region,object
Continent,object
Year Founded,int64
Funding,object
Select Investors,object


In [15]:
# get summary of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Company           1074 non-null   object
 1   Valuation         1074 non-null   object
 2   Date Joined       1074 non-null   object
 3   Industry          1074 non-null   object
 4   City              1057 non-null   object
 5   Country/Region    1074 non-null   object
 6   Continent         1074 non-null   object
 7   Year Founded      1074 non-null   int64 
 8   Funding           1074 non-null   object
 9   Select Investors  1074 non-null   object
dtypes: int64(1), object(9)
memory usage: 84.0+ KB


In [17]:
# check the number of missing values
df.isna().sum().to_frame('Missing Values')

Unnamed: 0,Missing Values
Company,0
Valuation,0
Date Joined,0
Industry,0
City,17
Country/Region,0
Continent,0
Year Founded,0
Funding,0
Select Investors,0


Next we will display the rows with missing values.

In [18]:
# any: at least one has a True value, then the result is True
# all: all data must have a True value, then the result is True
# by default on axis=0
# checks rows that contain missing values
condition = df.isna().any(axis=1)

In [19]:
# display rows that have at least 1 missing value
df_missing_rows = df[condition]
df_missing_rows

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
12,FTX,$32B,2021-07-20,Fintech,,Bahamas,North America,2018,$2B,"Sequoia Capital, Thoma Bravo, Softbank"
169,HyalRoute,$4B,2020-05-26,Mobile & telecommunications,,Singapore,Asia,2015,$263M,Kuang-Chi
241,Moglix,$3B,2021-05-17,E-commerce & direct-to-consumer,,Singapore,Asia,2015,$471M,"Jungle Ventures, Accel, Venture Highway"
250,Trax,$3B,2019-07-22,Artificial intelligence,,Singapore,Asia,2010,$1B,"Hopu Investment Management, Boyu Capital, DC T..."
324,Amber Group,$3B,2021-06-21,Fintech,,Hong Kong,Asia,2015,$328M,"Tiger Global Management, Tiger Brokers, DCM Ve..."
381,Ninja Van,$2B,2021-09-27,"Supply chain, logistics, & delivery",,Singapore,Asia,2014,$975M,"B Capital Group, Monk's Hill Ventures, Dynamic..."
511,ZocDoc,$2B,2015-08-20,Health,,United States,North America,2007,$374M,Founders Fund
540,Advance Intelligence Group,$2B,2021-09-23,Artificial intelligence,,Singapore,Asia,2016,$536M,"Vision Plus Capital, GSR Ventures, ZhenFund"
810,Carousell,$1B,2021-09-15,E-commerce & direct-to-consumer,,Singapore,Asia,2012,$288M,"500 Global, Rakuten Ventures, Golden Gate Vent..."
847,Matrixport,$1B,2021-06-01,Fintech,,Singapore,Asia,2019,$100M,"Dragonfly Captial, Qiming Venture Partners, DS..."


Was duplicate data found?

In [22]:
# count the number of duplicates
df.duplicated().sum()

np.int64(0)

Based on the above test, we found that there are no rows whose values ​​in each column are exactly the same as duplicates. However, we will try to check more closely to see duplicates based on the `Company` column.

In [23]:
df[df.duplicated(subset=['Company'], keep=False)]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Funding,Select Investors
385,BrewDog,$2B,2017-04-10,Consumer & retail,Aberdeen,United Kingdom,Europe,2007,$233M,"TSG Consumer Partners, Crowdcube"
386,BrewDog,$2B,2017-04-10,Consumer & retail,Aberdeen,UnitedKingdom,Europe,2007,$233M,TSG Consumer Partners
510,ZocDoc,$2B,2015-08-20,Health,New York,United States,North America,2007,$374M,"Founders Fund, Khosla Ventures, Goldman Sachs"
511,ZocDoc,$2B,2015-08-20,Health,,United States,North America,2007,$374M,Founders Fund
1031,SoundHound,$1B,2018-05-03,Artificial intelligence,Santa Clara,United States,North America,2005,$215M,"Tencent Holdings, Walden Venture Capital, Glob..."
1032,SoundHound,$1B,2018-05-03,Other,Santa Clara,United States,North America,2005,$215M,Tencent Holdings


### `Describing and Summarizing`

Get summary statistics from numeric columns

In [24]:
df.describe()

# count: number of data that are not NaN
# mean: average
# std: standard deviation
# min: minimum value
# 25%: quartile 1
# 50%: quartile 2 or median
# 75%: quartile 3
# max: maximum value

Unnamed: 0,Year Founded
count,1074.0
mean,2012.870577
std,5.705494
min,1919.0
25%,2011.0
50%,2014.0
75%,2016.0
max,2021.0


Statistical summary of categorical columns

In [25]:
df.describe(include='object')

# count: number of data that are not NaN
# unique: number of unique values
# top: mode (the value that appears most often)
# freq: frequency of occurrence of the mode

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Funding,Select Investors
count,1074,1074,1074,1074,1057,1074,1074,1074,1074
unique,1071,30,638,18,256,47,6,538,1059
top,BrewDog,$1B,2021-07-13,Fintech,San Francisco,United States,North America,$1B,Sequoia Capital
freq,2,472,9,204,149,561,588,59,3


There are also methods for certain statistics. The following are some examples.

| Method | Description | Data types |
| --- | --- | --- |
| `count()` | The number of non-null observations | Any |
| `value_counts()` | The total of each unique value we have | Any |
| `unique()` | Show unique values ​​| Any |
| `nunique()` | The number of unique values ​​| Any |
| `sum()` | The total of the values ​​| Numerical or Boolean |
| `mean()` | The average of the values ​​| Numerical or Boolean |
| `median()` | The median of the values ​​| Numerical |
| `min()` | The minimum of the values ​​| Numerical |
| `idxmin()` | The index where the minimum values ​​occur | Numerical |
| `max()` | The maximum of the values ​​| Numerical |
| `idxmax()` | The index where the maximum values ​​occur | Numerical |
| `abs()` | The absolute values ​​of the data | Numerical |
| `std()` | The standard deviation | Numerical |
| `var()` | The variance | Numerical |
| `cov()` | The covariance between two `Series`, or a covariance matrix for all column combinations in a `DataFrame` | Numerical |
| `corr()` | The correlation between two `Series`, or a correlation matrix for all column combinations in a `DataFrame` | Numerical |
| `quantile()` | Calculate a specific quantile | Numerical |
| `cumsum()` | The cumulative sum | Numerical or Boolean |
| `cummin()` | The cumulative minimum | Numerical |
| `cummax()` | The cumulative maximum | Numerical |

For example, are there any unique values ​​in the `Continent` column? How many are there?

In [26]:
# find out unique value --> unique()
df['Continent'].unique()

array(['Asia', 'North America', 'Europe', 'Oceania', 'South America',
       'Africa'], dtype=object)

In [27]:
# find out the number of unique values
len(df['Continent'].unique())

6

In [28]:
# find out the number of unique values
df['Continent'].nunique()

6

What is the median `Year Founded` of companies located in `Continent` Europe?

In [29]:
df[df['Continent']=='Europe']['Year Founded'].median()

2013.5

What industries are covered in the dataset?

In [30]:
df['Industry'].unique()

array(['Artificial intelligence', 'Other',
       'E-commerce & direct-to-consumer', 'FinTech', 'Fintech',
       'Internet software & services',
       'Supply chain, logistics, & delivery', 'Consumer & retail',
       'Data management and analytics', 'Edtech', 'Health', 'Hardware',
       'Auto & transportation', 'Travel', 'Cybersecurity',
       'Mobile & telecommunications', 'Data management & analytics',
       'Artificial Intelligence'], dtype=object)

How many companies are there in each industry?

In [31]:
df['Industry'].value_counts()

Unnamed: 0_level_0,count
Industry,Unnamed: 1_level_1
Fintech,204
Internet software & services,204
E-commerce & direct-to-consumer,111
Health,75
Artificial intelligence,73
Other,59
"Supply chain, logistics, & delivery",57
Cybersecurity,50
Mobile & telecommunications,38
Data management & analytics,35


In [32]:
# normalize: calculate the proportion so that the total is equal to 1 (100%)
# round: to round
(df['Industry'].value_counts(normalize=True)*100).round(2)

Unnamed: 0_level_0,proportion
Industry,Unnamed: 1_level_1
Fintech,18.99
Internet software & services,18.99
E-commerce & direct-to-consumer,10.34
Health,6.98
Artificial intelligence,6.8
Other,5.49
"Supply chain, logistics, & delivery",5.31
Cybersecurity,4.66
Mobile & telecommunications,3.54
Data management & analytics,3.26


How many `Country/Region` are there in the dataset?

In [33]:
df['Country/Region'].nunique()

47

Perusahaan tertua yang menjadi Unicorn didirikan pada tahun berapa?

In [34]:
df['Year Founded'].min()

1919

## **2. Data Cleaning**

Some of the methods that we will use to clean a dataset include the following.

| Method | Description |
| --- | --- |
| `replace()` | Replace values. |
| `drop()` | Drop specified labels from rows or columns. |
| `rename()` | Rename columns or index labels. |
| `assign()` | Assign new columns to a DataFrame. |
| `apply()` | Apply a function along an axis of the DataFrame. |
| `dropna()` | Remove missing values. |
| `fillna()` | Fill NA/NaN values ​​using the specified method. |
| `drop_duplicates()` | Return DataFrame with duplicate rows removed. |
| `reset_index()` | Reset the index, or a level of it. |

### `Type correction`

Is there something different with the data type? `Date Joined` should be stored as a time. Let's fix this.

In [35]:
# pd.to_datetime: changes data type to datetime
df['Date Joined'] = pd.to_datetime(df['Date Joined'])

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Company           1074 non-null   object        
 1   Valuation         1074 non-null   object        
 2   Date Joined       1074 non-null   datetime64[ns]
 3   Industry          1074 non-null   object        
 4   City              1057 non-null   object        
 5   Country/Region    1074 non-null   object        
 6   Continent         1074 non-null   object        
 7   Year Founded      1074 non-null   int64         
 8   Funding           1074 non-null   object        
 9   Select Investors  1074 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 84.0+ KB


### `Typo Correction`

In the `Industry` column, there are some data that have typos. The correct list of industries is in the following list.

In [38]:
industry_list = ['Artificial intelligence', 'Other','E-commerce & direct-to-consumer', 'Fintech',
  'Internet software & services', 'Supply chain, logistics, & delivery', 'Consumer & retail',
'Data management & analytics', 'Edtech', 'Health', 'Hardware','Auto & transportation',
'Travel', 'Cybersecurity','Mobile & telecommunications']

We will display values ​​in the `Industry` column that are not listed in the industry_list.

In [39]:
for industry in df['Industry'].unique():
    if industry not in industry_list:
        print(industry)

FinTech
Data management and analytics
Artificial Intelligence


In [40]:
set(df['Industry'].unique()).difference(set(industry_list))

{'Artificial Intelligence', 'Data management and analytics', 'FinTech'}

We will try to fix the incorrect data in the `Industry` column.

In [41]:
# create a dictionary containing {'old_value' : 'new_value'}
replacement_dict = {
'Artificial Intelligence' : 'Artificial intelligence',
'Data management and analytics': 'Data management & analytics',
'FinTech': 'Fintech'
}

df['Industry'] = df['Industry'].replace(replacement_dict)

In [42]:
# We can also replace be one by one like this
df['Industry'].replace('FinTech', 'Fintech')

Unnamed: 0,Industry
0,Artificial intelligence
1,Other
2,E-commerce & direct-to-consumer
3,Fintech
4,Fintech
...,...
1069,E-commerce & direct-to-consumer
1070,E-commerce & direct-to-consumer
1071,Consumer & retail
1072,Fintech


The amount of incorrect data in the `Industry` column after the changes were made is as follows:

In [43]:
len(set(df['Industry'].unique()).difference(set(industry_list)))

0

We will change 'UnitedKingdom' to 'United Kingdom' in the `Country/Region` column.

In [44]:
df['Country/Region'] = df['Country/Region'].replace('UnitedKingdom', 'United Kingdom')

### `Drop unnecesary features`

Let’s start by removing the column we won’t be using in this analysis, the `Funding` column.

In [45]:
# to make the changes permanent, don't forget to reassign
# or use inplace=True
df.drop(columns=['Funding'], inplace=True)

In [46]:
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country/Region,Continent,Year Founded,Select Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S..."
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen..."
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China..."
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG"
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita..."


### `Fixing feature name`

Next, rename the `Country/Region` column to `Country`:

In [47]:
df.rename(columns={'Country/Region':'Country'}, inplace=True)

In [48]:
df.columns

Index(['Company', 'Valuation', 'Date Joined', 'Industry', 'City', 'Country',
       'Continent', 'Year Founded', 'Select Investors'],
      dtype='object')

### `Add new feature`

Let's create a new column namely:

1. Year Joined
2. Years_To_Unicorn
3. Valuation Number (in B)
4. Valuation Class
5. Number of Investors

**Year Joined**

The `Year Joined` column contains the year information from the `Date Joined` column.

In [49]:
df['Year Joined'] = df['Date Joined'].dt.year
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011


**Years_To_Unicorn**

The `Years_To_Unicorn` column is the length of time from when the company was founded until it joined as a unicorn company.

In [50]:
df['Years_To_Unicorn'] = df['Year Joined'] - df['Year Founded']
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6


In [51]:
# another alternative to adding a column is to use assign
# syntax
# df.assign(
# new_column1=function1,
# new_column2=function2
# )

df = df.assign(
Years_To_Unicorn=lambda x: x['Year Joined'] - x['Year Founded']
)

We display statistical information from the `Years_To_Unicorn` column.

In [52]:
df['Years_To_Unicorn'].describe()

Unnamed: 0,Years_To_Unicorn
count,1074.0
mean,7.013035
std,5.331842
min,-3.0
25%,4.0
50%,6.0
75%,9.0
max,98.0


It can be seen that there are rows with negative 'Years_To_Unicorn' values.

In [53]:
df[df['Years_To_Unicorn']<0]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn
527,InVision,$2B,2017-11-01,Internet software & services,New York,United States,North America,2020,"FirstMark Capital, Tiger Global Management, IC...",2017,-3


Based on the results of an Internet search, it is known that the company was founded in 2011. We change the 'Year Founded' value of the company to 2011.

In [54]:
df.loc[527, 'Year Founded'] = 2011
df.loc[[527]]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn
527,InVision,$2B,2017-11-01,Internet software & services,New York,United States,North America,2011,"FirstMark Capital, Tiger Global Management, IC...",2017,-3


We recalculate the 'Years_To_Unicorn' column and check that there are no negative values ​​in the column.

In [55]:
df['Years_To_Unicorn'] = df['Year Joined'] - df['Year Founded']
df['Years_To_Unicorn'].describe()

Unnamed: 0,Years_To_Unicorn
count,1074.0
mean,7.021415
std,5.323155
min,0.0
25%,4.0
50%,6.0
75%,9.0
max,98.0


**Valuation Number (in B)**

The `Valuation Number (in B)` column contains the integer value from the `Valuation` column.

In [56]:
def str_to_num(valuation):
    valuation = valuation.strip('$B')
    valuation = int(valuation)
    return valuation

In [57]:
# example of apply using regular function
df['Valuation Number (in B)'] = df['Valuation'].apply(str_to_num)
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B)
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46


In [58]:
# example of apply using lambda function
df['Valuation'].apply(lambda x: int(x[1:-1]))

Unnamed: 0,Valuation
0,180
1,100
2,100
3,95
4,46
...,...
1069,1
1070,1
1071,1
1072,1


**Valuation Class**

The `Valuation Class` column is `Low` if the company is in the bottom 50% of valuation values ​​and `High` if the company is in the top 50%.

In [59]:
med = df['Valuation Number (in B)'].median()
df['Valuation Class'] = df['Valuation Number (in B)'].apply(lambda x: 'High' if x > med else 'Low')
df.head()

# note this method is risky if it turns out that the data after the median still has a value of 2

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180,High
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100,High
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100,High
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95,High
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46,High


In [60]:
# divide into equal groups
# syntax pd.qcut(data, number_of_groups, label_according_to_number_of_groups)
df['Valuation Class'] = pd.qcut(df['Valuation Number (in B)'], 2, ['Low', 'High'])

**Number Investors**

The `Number Investors` column contains the value of the number of investors from the `Select Investors` column.

In [61]:
df['Number Investors'] = df['Select Investors'].apply(lambda x: len(x.split(', ')))
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180,High,4
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100,High,3
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100,High,3
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95,High,3
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46,High,3


### `Handling Missing Values`

There are several ways to handle missing values, which are very important in EDA. The two main methods are to delete them or to impute/insert other values ​​in their place. The choice of the right method depends on the business problem and the added value of the given solution.

Here, we will try both.

**Removing missing values**

To compare the effects of different actions, first store the original data count in a variable. Create a variable called `count_total` which is an integer representing the total number of values ​​in `df`. For example, if the dataframe has 5 rows and 2 columns, then the number is 10.

In [63]:
# the number of data results from multiplying rows and columns
count_total = df.size
count_total

15036

Now, delete all rows containing missing values ​​and store the number of remaining data in a variable named `count_dropna_rows`

In [64]:
# delete the entire row even if the missing value is only in 1 column
count_dropna_rows = df.dropna(axis=0).size
count_dropna_rows

14798

Now, drop all the columns containing missing values ​​and store the number of remaining data in a variable named `count_dropna_columns`.

In [65]:
count_dropna_columns = df.dropna(axis=1).size
count_dropna_columns

13962

Next, print the percentage of values ​​removed by each method and compare.

In [66]:
print(f'Percentage after dropping rows {(count_total - count_dropna_rows)/count_total * 100:.2f}%')
print(f'Percentage after dropping columns {(count_total - count_dropna_columns)/count_total * 100:.2f}%')

Percentage after dropping rows 1.58%
Percentage after dropping columns 7.14%


A more effective method is to delete rows because the amount of data lost is less than deleting columns.

**Filling missing values**

Now, we will practice the second method by: imputation. We can fill missing values ​​using the method in DataFrame, namely [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna). For example, we will fill missing values ​​in the `city` column with its mode.

In [67]:
modus_city = df['City'].mode()[0]
modus_city

'San Francisco'

In [68]:
# imputation with modus_city
df_fillna = df.copy()
df_fillna['City'] = df['City'].fillna(modus_city )

In [69]:
df_fillna.loc[df_missing_rows.index]

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
12,FTX,$32B,2021-07-20,Fintech,San Francisco,Bahamas,North America,2018,"Sequoia Capital, Thoma Bravo, Softbank",2021,3,32,High,3
169,HyalRoute,$4B,2020-05-26,Mobile & telecommunications,San Francisco,Singapore,Asia,2015,Kuang-Chi,2020,5,4,High,1
241,Moglix,$3B,2021-05-17,E-commerce & direct-to-consumer,San Francisco,Singapore,Asia,2015,"Jungle Ventures, Accel, Venture Highway",2021,6,3,High,3
250,Trax,$3B,2019-07-22,Artificial intelligence,San Francisco,Singapore,Asia,2010,"Hopu Investment Management, Boyu Capital, DC T...",2019,9,3,High,3
324,Amber Group,$3B,2021-06-21,Fintech,San Francisco,Hong Kong,Asia,2015,"Tiger Global Management, Tiger Brokers, DCM Ve...",2021,6,3,High,3
381,Ninja Van,$2B,2021-09-27,"Supply chain, logistics, & delivery",San Francisco,Singapore,Asia,2014,"B Capital Group, Monk's Hill Ventures, Dynamic...",2021,7,2,Low,3
511,ZocDoc,$2B,2015-08-20,Health,San Francisco,United States,North America,2007,Founders Fund,2015,8,2,Low,1
540,Advance Intelligence Group,$2B,2021-09-23,Artificial intelligence,San Francisco,Singapore,Asia,2016,"Vision Plus Capital, GSR Ventures, ZhenFund",2021,5,2,Low,3
810,Carousell,$1B,2021-09-15,E-commerce & direct-to-consumer,San Francisco,Singapore,Asia,2012,"500 Global, Rakuten Ventures, Golden Gate Vent...",2021,9,1,Low,3
847,Matrixport,$1B,2021-06-01,Fintech,San Francisco,Singapore,Asia,2019,"Dragonfly Captial, Qiming Venture Partners, DS...",2021,2,1,Low,3


The imputation results are nonsensical, as there is no city named San Francisco in the Bahamas, Singapore, and Hong Kong.

Another option is to fill in the values ​​with a specific value, such as 'Unknown'. However, this does not add any value to the dataset and can make it difficult to find missing values ​​in the future.

While we decide to delete rows that have missing values

In [70]:
# delete rows that have missing values
df = df.dropna()

### `Handling Duplicates`

Next we will deal with duplicate data. Every dataset is unique and we cannot treat every dataset the same way. When we make decisions about whether or not to remove duplicate values, we must think carefully about the dataset itself and the goals we are trying to achieve.

1. Deciding to remove duplicates

We should remove or eliminate duplicate values ​​if the duplicate values ​​are clearly in error or would misrepresent the remaining unique values ​​in the dataset.

For example, we are fairly certain that a data professional would (in most cases) remove duplicate values ​​from a dataset containing home addresses and home prices. Counting the same home twice (in most cases) would misrepresent conclusions drawn from the dataset as a whole, such as the average home price, total home prices, or even the total number of homes. In such cases, a data professional would almost certainly remove the duplicate data in order to fairly represent the remaining data during analysis and visualization.

2. Deciding NOT to discard duplicates

We should keep duplicate data in our datasets if the duplicate values ​​are clearly not errors and must be accounted for when representing the dataset as a whole.

For example, a dataset that characterizes the number of throws and distances an Olympic shot putter makes in training will likely include some duplicate distances; simply by virtue of the number of attempts and the limitations on the weight of a person’s balls, there will be duplicate values—especially if the distance measurements are labeled to only 1 or 2 decimal places. In such cases, a data professional would almost certainly keep all of the data to adequately represent it as a whole during analysis and visualization.

On this occasion, we will handle duplicate data by deleting it. The form of data before deleting duplicate data is as follows:

In [71]:
df.shape

(1057, 14)

For each duplicate data, we will retain the row with the first occurrence and delete the rows on subsequent occurrences.

In [72]:
# delete rows with the same company column
df = df.drop_duplicates(subset='Company')

The data form after deleting duplicate data is as follows:

In [73]:
df.shape

(1055, 14)

After removing duplicates, sometimes we want to reset our index to the row number and restore the column. We can do this with the `reset_index()` method

In [74]:
df = df.reset_index(drop=True)

## **3. Data Sorting**

Some of the methods that we will use to sort include the following.

| Method | Description |
| --- | --- |
| `sort_values()` | Sort by the values ​​along either axis. |
| `sort_index()` | Sort objects by labels (along an axis). |
| `nlargest()` | Return the first n rows ordered by columns in descending order. |
| `nsmallest()` | Return the first n rows ordered by columns in ascending order. |

We can use the `sort_values()` method to sort by any number of columns:

In [75]:
# sort by Years_To_Unicorn
# ascending=True --> from smallest to largest, A - Z
# ascending=False --> from largest to smallest, Z - A
df = df.sort_values('Years_To_Unicorn', ascending=True)
df

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
983,Playco,$1B,2020-09-21,Other,Tokyo,Japan,Asia,2020,"Sozo Ventures, Caffeinated Capital, Sequoia Ca...",2020,0,1,Low,3
952,Mensa Brands,$1B,2021-11-16,Other,Bengaluru,India,Asia,2021,"Accel, Falcon Edge Capital, Norwest Venture Pa...",2021,0,1,Low,3
765,Jokr,$1B,2021-12-02,E-commerce & direct-to-consumer,New York,United States,North America,2021,"GGV Capital, Tiger Global Management, Greycroft",2021,0,1,Low,3
389,candy.com,$2B,2021-10-21,Fintech,New York,United States,North America,2021,"Insight Partners, Softbank Group, Connect Vent...",2021,0,2,Low,3
159,Ola Electric Mobility,$5B,2019-07-02,Auto & transportation,Bengaluru,India,Asia,2019,"SoftBank Group, Tiger Global Management, Matri...",2019,0,5,High,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
829,Radius Payment Solutions,$1B,2017-11-27,Fintech,Crewe,United Kingdom,Europe,1990,Inflexion Private Equity,2017,27,1,Low,1
1025,Thirty Madison,$1B,2021-06-02,Health,New York,United States,North America,1993,"Northzone Ventures, Maveron, Johnson & Johnson...",2021,28,1,Low,3
367,Promasidor Holdings,$2B,2016-11-08,Consumer & retail,Bryanston,South Africa,Asia,1979,"IFC, Ajinomoto",2016,37,2,Low,2
689,Five Star Business Finance,$1B,2021-03-26,Other,Chennai,India,Asia,1984,"Sequoia Capital India, Tiger Global Management...",2021,37,1,Low,3


In [76]:
df = df.sort_values(['Years_To_Unicorn', 'Valuation Number (in B)'], ascending=[True, True])
df

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
983,Playco,$1B,2020-09-21,Other,Tokyo,Japan,Asia,2020,"Sozo Ventures, Caffeinated Capital, Sequoia Ca...",2020,0,1,Low,3
952,Mensa Brands,$1B,2021-11-16,Other,Bengaluru,India,Asia,2021,"Accel, Falcon Edge Capital, Norwest Venture Pa...",2021,0,1,Low,3
765,Jokr,$1B,2021-12-02,E-commerce & direct-to-consumer,New York,United States,North America,2021,"GGV Capital, Tiger Global Management, Greycroft",2021,0,1,Low,3
811,GlobalBees,$1B,2021-12-28,E-commerce & direct-to-consumer,New Delhi,India,Asia,2021,"Chiratae Ventures, SoftBank Group, Trifecta Ca...",2021,0,1,Low,3
389,candy.com,$2B,2021-10-21,Fintech,New York,United States,North America,2021,"Insight Partners, Softbank Group, Connect Vent...",2021,0,2,Low,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,Epic Games,$32B,2018-10-26,Other,Cary,United States,North America,1991,"Tencent Holdings, KKR, Smash Ventures",2018,27,32,High,3
1025,Thirty Madison,$1B,2021-06-02,Health,New York,United States,North America,1993,"Northzone Ventures, Maveron, Johnson & Johnson...",2021,28,1,Low,3
689,Five Star Business Finance,$1B,2021-03-26,Other,Chennai,India,Asia,1984,"Sequoia Capital India, Tiger Global Management...",2021,37,1,Low,3
367,Promasidor Holdings,$2B,2016-11-08,Consumer & retail,Bryanston,South Africa,Asia,1979,"IFC, Ajinomoto",2016,37,2,Low,2


To select the largest and smallest rows, use `largest()` and `smallest()` instead. By looking at the longest time to become a unicorn:

In [77]:
# display the 5 largest data based on Years_To_Unicorn
df.nlargest(5, 'Years_To_Unicorn')

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
186,Otto Bock HealthCare,$4B,2017-06-24,Health,Duderstadt,Germany,Europe,1919,EQT Partners,2017,98,4,High,1
689,Five Star Business Finance,$1B,2021-03-26,Other,Chennai,India,Asia,1984,"Sequoia Capital India, Tiger Global Management...",2021,37,1,Low,3
367,Promasidor Holdings,$2B,2016-11-08,Consumer & retail,Bryanston,South Africa,Asia,1979,"IFC, Ajinomoto",2016,37,2,Low,2
1025,Thirty Madison,$1B,2021-06-02,Health,New York,United States,North America,1993,"Northzone Ventures, Maveron, Johnson & Johnson...",2021,28,1,Low,3
829,Radius Payment Solutions,$1B,2017-11-27,Fintech,Crewe,United Kingdom,Europe,1990,Inflexion Private Equity,2017,27,1,Low,1


Since we have a sample of the full data set, let's re-sort the rows by index. column alphabetically:

In [78]:
df.sort_index(axis=0, inplace=True)
df.head()

Unnamed: 0,Company,Valuation,Date Joined,Industry,City,Country,Continent,Year Founded,Select Investors,Year Joined,Years_To_Unicorn,Valuation Number (in B),Valuation Class,Number Investors
0,Bytedance,$180B,2017-04-07,Artificial intelligence,Beijing,China,Asia,2012,"Sequoia Capital China, SIG Asia Investments, S...",2017,5,180,High,4
1,SpaceX,$100B,2012-12-01,Other,Hawthorne,United States,North America,2002,"Founders Fund, Draper Fisher Jurvetson, Rothen...",2012,10,100,High,3
2,SHEIN,$100B,2018-07-03,E-commerce & direct-to-consumer,Shenzhen,China,Asia,2008,"Tiger Global Management, Sequoia Capital China...",2018,10,100,High,3
3,Stripe,$95B,2014-01-23,Fintech,San Francisco,United States,North America,2010,"Khosla Ventures, LowercaseCapital, capitalG",2014,4,95,High,3
4,Klarna,$46B,2011-12-12,Fintech,Stockholm,Sweden,Europe,2005,"Institutional Venture Partners, Sequoia Capita...",2011,6,46,High,3


We can also sort the columns alphabetically:

In [79]:
df.sort_index(axis=1, inplace=True)
df.head()

Unnamed: 0,City,Company,Continent,Country,Date Joined,Industry,Number Investors,Select Investors,Valuation,Valuation Class,Valuation Number (in B),Year Founded,Year Joined,Years_To_Unicorn
0,Beijing,Bytedance,Asia,China,2017-04-07,Artificial intelligence,4,"Sequoia Capital China, SIG Asia Investments, S...",$180B,High,180,2012,2017,5
1,Hawthorne,SpaceX,North America,United States,2012-12-01,Other,3,"Founders Fund, Draper Fisher Jurvetson, Rothen...",$100B,High,100,2002,2012,10
2,Shenzhen,SHEIN,Asia,China,2018-07-03,E-commerce & direct-to-consumer,3,"Tiger Global Management, Sequoia Capital China...",$100B,High,100,2008,2018,10
3,San Francisco,Stripe,North America,United States,2014-01-23,Fintech,3,"Khosla Ventures, LowercaseCapital, capitalG",$95B,High,95,2010,2014,4
4,Stockholm,Klarna,Europe,Sweden,2011-12-12,Fintech,3,"Institutional Venture Partners, Sequoia Capita...",$46B,High,46,2005,2011,6
