# Data Exploration

## Reading clean data

In [None]:
import pandas as pd
df = pd.read_csv("data/clean.csv")
df.head(3)

## Year wise industries count

**Question for Discussion:**

* How do you count the number of industries per year? 
* How should we plot that?
* What does that tell us?

### Count industries using groupby

Lets see the count for each year.

In [None]:
df.groupby("register_year").size()

### Bar graph using plot function

Lets prepare a bar graph with default parameters.

In [None]:
df.groupby("register_year").size().plot(kind="??")

### Improved bar graph with increased figsize

The image is hard to read. Lets increase the size of the image.

In [None]:
df.groupby("register_year").size().plot(kind="??", figsize=(??))

We can see that something happened in 2049 and hence there's high number of industry registrations then. This needs further research.

## FDI trend

Lets see how many of these industries have received FDI.

**Question for Discussion**

* How do add FDI data to the above trend?

In [None]:
df["has_fdi"] = ??

In [None]:
df.groupby(["register_year", "has_fdi"]).size().unstack().plot(kind='bar', stacked=True, figsize=(12,5))

We see that FDI flows into Nepal since 2069 only based on data. Validity of this data will need further research. But for now we will trust this data. 

Lets focus on last 5 years only as we have FDI flowing in since last 5 years only

## Convergence in Analysis

Now we will create a new dataframe so that our analysis narrows down to last 5 years data only.

In [None]:
df_5years = df.query("??")

In [None]:
df_5years.head(3)

## FDI vs NonFDI analysis

In [None]:
df_5years.plot(kind = "scatter", x = "working_capital", y = "employment", alpha=0.5)

## Scatter plot with different scales for FDI

Lets show different scales in in scatter plot

In [None]:
df_foreign = df_5years[??]

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot()
ax.scatter(df_foreign.query("scale == 'SMALL'").working_capital, df_foreign.query("scale == 'SMALL'").employment, alpha=0.2, color="blue")
ax.scatter(df_foreign.query("scale == 'MEDIUM'").working_capital, df_foreign.query("scale == 'MEDIUM'").employment, alpha=0.6, color="orange")
ax.scatter(df_foreign.query("scale == 'LARGE'").working_capital, df_foreign.query("scale == 'LARGE'").employment, alpha=0.6, color="green")

plt.show()

Everything is concentrated in the bottom-left corner. Lets spread the data using log scale

In [None]:
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)
ax.scatter(df_foreign.query("scale == 'SMALL'").working_capital, df_foreign.query("scale == 'SMALL'").employment, alpha=0.2, color="blue", label="small")
ax.scatter(df_foreign.query("scale == 'MEDIUM'").working_capital, df_foreign.query("scale == 'MEDIUM'").employment, alpha=0.6, color="orange", label="medium")
ax.scatter(df_foreign.query("scale == 'LARGE'").working_capital, df_foreign.query("scale == 'LARGE'").employment, alpha=0.6, color="green", label="large")
ax.set_ylim([0,1000])
ax.set_yscale('symlog')
ax.set_xscale('log')
ax.legend()

plt.show()

## Scatter plot with different scales for 100% local investment

**Exercise:**

* Conduct the above analysis for 100% local investment.

In [None]:
df_local = df_5years[df_5years.has_fdi==False]

fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)
ax.scatter(df_local.query("scale == 'SMALL'").working_capital, df_local.query("scale == 'SMALL'").employment, alpha=0.2, color="blue", label="small")
ax.scatter(df_local.query("scale == 'MEDIUM'").working_capital, df_local.query("scale == 'MEDIUM'").employment, alpha=0.6, color="orange", label="medium")
ax.scatter(df_local.query("scale == 'LARGE'").working_capital, df_local.query("scale == 'LARGE'").employment, alpha=0.6, color="green", label="large")
ax.set_ylim([0,1000])
ax.set_yscale('symlog')
ax.set_xscale('log')
ax.legend()

plt.show()

## Scatter plot side by side

Lets put the above charts side by side.

In [None]:
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1.scatter(df_local.query("scale == 'SMALL'").working_capital, df_local.query("scale == 'SMALL'").employment, alpha=0.2, color="blue", label="small")
ax1.scatter(df_local.query("scale == 'MEDIUM'").working_capital, df_local.query("scale == 'MEDIUM'").employment, alpha=0.6, color="orange", label="medium")
ax1.scatter(df_local.query("scale == 'LARGE'").working_capital, df_local.query("scale == 'LARGE'").employment, alpha=0.6, color="green", label="large")
ax1.set_ylim([0,1000])
ax1.set_yscale('symlog')
ax1.set_xscale('log')
ax1.set_title("Non-FDI")
ax1.legend()

ax2.scatter(df_foreign.query("scale == 'SMALL'").working_capital, df_foreign.query("scale == 'SMALL'").employment, alpha=0.2, color="blue", label="small")
ax2.scatter(df_foreign.query("scale == 'MEDIUM'").working_capital, df_foreign.query("scale == 'MEDIUM'").employment, alpha=0.6, color="orange", label="medium")
ax2.scatter(df_foreign.query("scale == 'LARGE'").working_capital, df_foreign.query("scale == 'LARGE'").employment, alpha=0.6, color="green", label="large")
ax2.set_ylim([0,1000])
ax2.set_yscale('symlog')
ax2.set_xscale('log')
ax2.set_title("FDI")
ax2.legend()

plt.show()

**Question for Discussion**

* What can you infer from above scatterplots?

We see that FDI flows more into small scale industries compared to large scale industries.

## Binning foreign investment into different ranges

In [None]:
Lets see how FDI investment varies like how many are 100% and how many are less than 100%.

In [None]:
bins = [0,10,49,99,100]
labels = ["<=10%","11 to 49%","50-less than 100%","100%"]
# df_foreign["fdi_range"] = pd.cut(df_foreign.foreign_percent, bins, labels = labels)
df_foreign.loc[:,"fdi_range"] = pd.cut(df_foreign.foreign_percent, bins, labels = labels)

In [None]:
df_foreign.groupby("fdi_range").size()

We see that majority of investment are 100%. It's interesting to see small investments. Lets identify those small investments.

In [None]:
df_foreign[df_foreign.foreign_percent<=10]

In [None]:
small_scale = df_foreign.query("foreign_percent == 100 and scale == 'SMALL'")["employment"]
medium_scale = df_foreign.query("foreign_percent == 100 and scale == 'MEDIUM'")["employment"]
large_scale = df_foreign.query("foreign_percent == 100 and scale == 'LARGE'")["employment"]

fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.boxplot([small_scale, medium_scale, large_scale], 
           labels = ["Small", "Medium", "Large"], meanline = True, showmeans = True)
plt.show()