# Project: Pandas dataframe data consumption omptimization (from Dataquest data engineer course)
## Introduction
In this project, assumme, there is limitation by memory which can be used to conduct analysis on Dataset. Lets say, only 20 MB of RAM, that can be used to store dataframe.

Crunchbase-investments dataset is given. The goal is to choose the correct datatype for each column to reduce overall memory consumption. It is not essential, what actual data is in the dataset. Only its data types matters.

So, this is stated:  
* **Problem**: 20 MB memory limitation  
* **Goal**: Reduce dataframe memory consuption as much as possible  
* **Approach**:
    * Clean corrupted values
    * Format columns
    * Determine appropriate data types to convert. Data types used: *Category, float, int, datetime, bool*.
    * Process in chunks to fit in memory limitation 

**Distinctions from "Loans_memory_optimization"**:  
* Checking for duplicates
* More attention to NaN values
* Parsing date month and quarter
* Determing source csv encoding


## 1. Determining csv-file encoding using chardet library

In [1]:
import chardet

with open("crunchbase-investments.csv", mode="rb") as file:
    enc = chardet.detect(bytes(file.read(2048)))
    print(enc)

with open("crunchbase-investments.csv") as file:
    print(file)

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
<_io.TextIOWrapper name='crunchbase-investments.csv' mode='r' encoding='cp1251'>


### Both encoding results the same. So we can use both.

In [2]:
import pandas as pd
import numpy as np

CSV_FILE_PATH = "crunchbase-investments.csv"
CSV_ENCODING = "cp1251"

cbi_10 = pd.read_csv(CSV_FILE_PATH, nrows=10, encoding=CSV_ENCODING)

cbi_10

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000
3,/company/zoopshop,ZoopShop,software,USA,OH,Columbus,columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012,20000
4,/company/efuneral,eFuneral,web,USA,OH,Cleveland,Cleveland,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011,20000
5,/company/tackk,Tackk,web,USA,OH,Cleveland,Cleveland,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-02-01,2012-02,2012-Q1,2012,20000
6,/company/acclaimd,Acclaimd,analytics,USA,OH,Columbus,Columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-06-01,2012-06,2012-Q2,2012,20000
7,/company/acclaimd,Acclaimd,analytics,USA,OH,Columbus,Columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-08-07,2012-08,2012-Q3,2012,70000
8,/company/toviefor,ToVieFor,ecommerce,USA,NY,New York,New York,/company/2010-nyu-stern-business-plan-competition,2010 NYU Stern Business Plan Competition,,,,unknown,,angel,2010-04-01,2010-04,2010-Q2,2010,75000
9,/company/ohk-labs,OHK Labs,sports,USA,FL,Palm Beach,Boca Raton,/company/22hundred-group,22Hundred Group,,,,unknown,,angel,2011-09-01,2011-09,2011-Q3,2011,100000


### Define, how much rows can be read to fit in 5 MB memory, so the 15MB can be used to execute whole process.

In [3]:
cbi_1000 = pd.read_csv(CSV_FILE_PATH, nrows=1000, encoding=CSV_ENCODING)
cbi_1000.memory_usage(deep=True).sum() / 2**20

1.1115789413452148

### So, 5000 rows will be read in each chunk

## 2. Check columns NaN values amount and columns memory usage. Clean corrupted data.

In [4]:
import pandas as pd

cbi_chunks = pd.read_csv(CSV_FILE_PATH, chunksize=5000, encoding=CSV_ENCODING)

cols_nan_values_amnt_by_chunks = []
cols_mem_usage_by_chunks = []
n_rows = 0

for chunk in cbi_chunks:
    cols_nan_values_amnt_by_chunks.append(chunk.isnull().sum())
    cols_mem_usage_by_chunks.append(chunk.memory_usage(deep=True))
    n_rows += chunk.shape[0]

cols_nan_values_amnt_concat = pd.concat(cols_nan_values_amnt_by_chunks)
cols_nan_values_amnt = cols_nan_values_amnt_concat.groupby(
    cols_nan_values_amnt_concat.index
).sum()

cols_mem_usage_concat = pd.concat(cols_mem_usage_by_chunks)
cols_mem_usage = (
    cols_mem_usage_concat.groupby(cols_mem_usage_concat.index).sum() / 2**20
)
cols_mem_usage = cols_mem_usage.drop("Index")

total_mem_usage = cols_mem_usage.sum()

In [5]:
cols_nan_values_amnt

company_category_code       643
company_city                533
company_country_code          1
company_name                  1
company_permalink             1
company_region                1
company_state_code          492
funded_at                     3
funded_month                  3
funded_quarter                3
funded_year                   3
funding_round_type            3
investor_category_code    50427
investor_city             12480
investor_country_code     12001
investor_name                 2
investor_permalink            2
investor_region               2
investor_state_code       16809
raised_amount_usd          3599
dtype: int64

In [6]:
cols_mem_usage

company_category_code     3.262619
company_city              3.343473
company_country_code      3.025223
company_name              3.425550
company_permalink         3.869808
company_region            3.253503
company_state_code        2.962161
funded_at                 3.378091
funded_month              3.226837
funded_quarter            3.226837
funded_year               0.403366
funding_round_type        3.252704
investor_category_code    0.593590
investor_city             2.752311
investor_country_code     2.524654
investor_name             3.735814
investor_permalink        4.749821
investor_region           3.239116
investor_state_code       2.361876
raised_amount_usd         0.403366
dtype: float64

In [7]:
total_mem_usage

56.99071979522705

### It is seen, that for such columns as "funded_month", "funded_year", "company_country_code" and so on, there are few NaN values. Lets find out, which rows contains ones.

In [8]:
cbi_chunks = pd.read_csv(CSV_FILE_PATH, chunksize=5000, encoding=CSV_ENCODING)

filtered_chunks_l = []

for chunk in cbi_chunks:
    filtered_chunk = chunk[chunk["funded_year"].isna()]
    if filtered_chunk.shape[0] > 1:
        filtered_chunks_l.append(filtered_chunk)

pd.concat(filtered_chunks_l)

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
34224,/company/kotura,KOTURA,cleantech,USA,CA,Los Angeles,Monterey Park,/financial-organization/rockley-ventures,Rockley Ventures,,GBR,,Woodstock,,,,,,,
34225,57 Woodstock Road,,,,,,,,,,,,,,,,,,,
34226,,series-c+,2/7/08,2008-02,2008-Q1,2008,10000000,,,,,,,,,,,,,


### These rows and not realy essential for analysis, as long as there are a lot of missing values. So they can be easily removed from dataframe.

### Now check for duplicates presence.

In [9]:
chunks = pd.read_csv(CSV_FILE_PATH, chunksize=5000, encoding=CSV_ENCODING)
n_duplicated = 0

for chunk in chunks:
    n_duplicated += chunk.duplicated().sum()

print(n_duplicated)

139


### Create functions for reading csv and for cleaning chunk duplicates and not essential rows with NaN values.

In [10]:
drop_cols = ["company_permalink", "investor_permalink", "investor_category_code"]
use_cols = cbi_10.columns.drop(drop_cols)


def get_chunks():
    return pd.read_csv(
        CSV_FILE_PATH, chunksize=5000, encoding=CSV_ENCODING, usecols=use_cols
    )


def clean_chunk(chunk):
    chunk.dropna(subset=["funded_year"], inplace=True)
    chunk.drop_duplicates(keep="first", inplace=True)

### Now, check once more for NaN values and duplicated rows, after using previously created function.

In [11]:
cbi_chunks = get_chunks()

cols_nan_values_amnt_by_chunks = []
n_duplicated = 0

for chunk in cbi_chunks:
    clean_chunk(chunk)
    cols_nan_values_amnt_by_chunks.append(chunk.isnull().sum())
    n_duplicated += chunk.duplicated().sum()

cols_nan_values_amnt_concat = pd.concat(cols_nan_values_amnt_by_chunks)
cols_nan_values_amnt = cols_nan_values_amnt_concat.groupby(
    cols_nan_values_amnt_concat.index
).sum()

print(cols_nan_values_amnt)
print(n_duplicated)

company_category_code      642
company_city               530
company_country_code         0
company_name                 0
company_region               0
company_state_code         491
funded_at                    0
funded_month                 0
funded_quarter               0
funded_year                  0
funding_round_type           0
investor_city            12421
investor_country_code    11943
investor_name                0
investor_region              0
investor_state_code      16735
raised_amount_usd         3585
dtype: int64
0


## 3. Defining convenient columns data types.

### Start with checking whether *string* and *numeric* columns behave constistent.

In [12]:
numeric = []
string = []

cbi_chunks = get_chunks()

for chunk in cbi_chunks:
    clean_chunk(chunk)
    numeric.append(chunk.select_dtypes(include=np.number).shape[1])
    string.append(chunk.select_dtypes(include="object").shape[1])

print(numeric)
print(string)

[2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5]
[15, 15, 15, 15, 15, 15, 15, 15, 15, 12, 12]


### Which columns appear in the last chunks?

In [13]:
numeric_cols_by_chunks = []
string_cols_by_chunks = []

cbi_chunks = get_chunks()

for chunk in cbi_chunks:
    clean_chunk(chunk)
    numeric_cols_by_chunks.append(
        chunk.select_dtypes(include=np.number).columns.tolist()
    )
    string_cols_by_chunks.append(chunk.select_dtypes(include="object").columns.tolist())


for cols in numeric_cols_by_chunks:
    print(cols)

['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['funded_year', 'raised_amount_usd']
['investor_country_code', 'investor_state_code', 'investor_city', 'funded_year', 'raised_amount_usd']
['investor_country_code', 'investor_state_code', 'investor_city', 'funded_year', 'raised_amount_usd']


### Are "investor_country_code", "investor_state_code" and "investor_city" columns real numeric?

In [14]:
real_numeric_cols = numeric_cols_by_chunks[-1].copy()

chunks = get_chunks()

for chunk in chunks:
    clean_chunk(chunk)
    chunk = chunk[real_numeric_cols]

    for col in real_numeric_cols:
        col_values = chunk[col].dropna()
        col_n_rows = col_values.shape[0]
        n_numeric_values = pd.to_numeric(col_values, errors="coerce").notna().sum()

        if n_numeric_values != col_n_rows:
            real_numeric_cols.remove(col)

print(real_numeric_cols)

['funded_year', 'raised_amount_usd']


### That is pretty obvious for columns represent date parts, but lets make it clear and figure out, is it possible to convert "funded_year" to *int* datatype

In [15]:
chunks = get_chunks()
is_int_convertable = True

for chunk in chunks:
    clean_chunk(chunk)
    col_values = chunk["funded_year"]
    if not np.array_equal(col_values, col_values.astype("int")):
        is_int_convertable = False

if is_int_convertable:
    print('"funded_year" can be converted to int')

"funded_year" can be converted to int


### Gather all columns and its datatypes in one dictionary for service usage.

In [16]:
# As long as "funded_year" columns contains no NaNs, it can be converted to int
columns_data_types = {"int": ["funded_year"], "float": ["raised_amount_usd"]}

In [17]:
string_cols_src = string_cols_by_chunks[0]

print(chunk[string_cols_src].dtypes)
chunk[string_cols_src]

company_name              object
company_category_code     object
company_country_code      object
company_state_code        object
company_region            object
company_city              object
investor_name             object
investor_country_code    float64
investor_state_code      float64
investor_region           object
investor_city            float64
funding_round_type        object
funded_at                 object
funded_month              object
funded_quarter            object
dtype: object


Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04,2008-Q2
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01,2010-Q1
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01,2010-Q1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52865,Garantia Data,enterprise,USA,CA,SF Bay,Santa Clara,Zohar Gilon,,,unknown,,series-a,2012-08-08,2012-08,2012-Q3
52866,DudaMobile,mobile,USA,CA,SF Bay,Palo Alto,Zohar Gilon,,,unknown,,series-c+,2013-04-08,2013-04,2013-Q2
52867,SiteBrains,software,USA,CA,SF Bay,San Francisco,zohar israel,,,unknown,,angel,2010-08-01,2010-08,2010-Q3
52868,Comprehend Systems,enterprise,USA,CA,SF Bay,Palo Alto,Zorba Lieberman,,,unknown,,series-a,2013-07-11,2013-07,2013-Q3


### From the chunk view, define datetime columns and append them to dictionary.

In [18]:
columns_data_types["datetime"] = ["funded_at"]
columns_data_types["int"].append("funded_month")  # Note, that it is very important to clean these columns (funded month and quarter) first
columns_data_types["int"].append("funded_quarter")

### Now, determine columns, which can be converted to *category* data type. At first step, delete columns with already defined datatype from string_cols list.

In [19]:
cols_with_defined_types = [
    col for key in columns_data_types for col in columns_data_types[key]
]

print(cols_with_defined_types)

['funded_year', 'funded_month', 'funded_quarter', 'raised_amount_usd', 'funded_at']


In [20]:
string_cols = list(set(string_cols_src).difference(cols_with_defined_types))

string_cols

['company_state_code',
 'investor_city',
 'company_region',
 'company_name',
 'company_category_code',
 'investor_name',
 'investor_country_code',
 'investor_region',
 'investor_state_code',
 'company_city',
 'company_country_code',
 'funding_round_type']

### Calculate *string* cols values uniqueness

In [21]:
import json

chunks = get_chunks()

vc_by_chunks = {col: [] for col in string_cols}

for chunk in chunks:
    clean_chunk(chunk)
    chunk = chunk[string_cols]
    for col in string_cols:
        vc_by_chunks[col].append(chunk[col].value_counts())

col_values_uniqueness = {}
columns_data_types["category"] = []

for col in vc_by_chunks:
    vc_concat = pd.concat(vc_by_chunks[col])
    unique_values_amount = len(vc_concat.groupby(vc_concat.index).sum().index)
    if unique_values_amount / n_rows < 0.5:
        columns_data_types["category"].append(col)

## 4. Converting columns to defined data types.

### Keep in mind, that "funded_at" must be converted to datetime first in order to be able to get *int* representation of funded_month and funded_quarter, so these columns can be converted to *int* futher.

In [22]:
import json
print(json.dumps(columns_data_types, indent=2))

{
  "int": [
    "funded_year",
    "funded_month",
    "funded_quarter"
  ],
  "float": [
    "raised_amount_usd"
  ],
  "datetime": [
    "funded_at"
  ],
  "category": [
    "company_state_code",
    "investor_city",
    "company_region",
    "company_name",
    "company_category_code",
    "investor_name",
    "investor_country_code",
    "investor_region",
    "investor_state_code",
    "company_city",
    "company_country_code",
    "funding_round_type"
  ]
}


In [23]:
chunks = get_chunks()
mem_usage_optimized = 0

for chunk in chunks:
    clean_chunk(chunk)

    chunk["funded_at"] = pd.to_datetime(chunk["funded_at"], format="%Y-%m-%d")
    chunk["funded_month"] = chunk["funded_at"].dt.month
    chunk["funded_quarter"] = (chunk["funded_month"] - 1) // 3 + 1

    int_cols = columns_data_types["int"]
    chunk[int_cols] = chunk[int_cols].astype("int")
    chunk[int_cols] = chunk[int_cols].apply(pd.to_numeric, downcast="integer")

    float_cols = columns_data_types["float"]
    chunk[float_cols] = chunk[float_cols].astype("float")
    chunk[float_cols] = chunk[float_cols].apply(pd.to_numeric, downcast="float")

    cat_cols = columns_data_types["category"]
    chunk[cat_cols] = chunk[cat_cols].astype("category")

    mem_usage_optimized += chunk.memory_usage(deep=True).sum()

mem_usage_optimized /= 2**20 

print(mem_usage_optimized)

7.136919975280762


## By converting columns to appropriate data types, the memory consuption was reduced by **840%!** So now the whole dataset fits in memory!