## Title: Artificial Dataset generator. Test different datasets
### Dataset Specification:
* The Dataset contains 1200 rows (observations) and 10 columns (variables)
* Detailed description of the variables:


| Variable Name     | Description                            | Format         | Min Value     | Max Value  | Missings | Notes                                                                                                                                                                             |
|-------------------|----------------------------------------|----------------|---------------|------------|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Date              | Months of observations                 | Date Type      | 1/1/2016      | 12/1/2016  | 0%       | Only the first day of each month                                                                                                                                                  |
| CompanyID         | Company's ID in the system             | Integer        | 0             | 100        | 0%       |                                                                                                                                                                                   |
| Revenue           | Monthly Company's Revenue              | Float, Decimal | 0             | 999,999.99 | 1%       | The missings are referred to the number of observations                                                                                                                           |
| Expenses          | Monthly Company's Expenses             | Float, Decimal | 0             | 500,000.00 | 1%       | The missings are referred to the number of observations                                                                                                                           |
| Profit            | Monthly Company's Profit               | Float, Decimal |               |            | 1%       | Profit = Revenue - Expenses. The missings are referred to the number of observations                                                                                              |
| LossFlag          | Monthly balance check                  | Binary         | 0             | 1          | 1%       | 0 = the company has profit, 1 = the company has negative balance. The missings are referred to the number of observations                                                         |
| Employees         | Number of employees in the company     | Integer        | 10            | 1000       | 0%       | Each company has a standard number of employees the whole year                                                                                                                    |
| Region            | Company's geographical location        | Categorical    | A, B, C, D, E |            | 0%       | The proportions of each Region are: A=25%, B=20%, C=10%, D=5%, E=40%                                                                                                              |
| BusinessValuation | Company's Market Value                 | Float, Decimal |               |            | 0%       | This value changes monthly and is equal to 3% till 10% of each company's profit                                                                                                   |
| ClosedFlag        | The company is no longer in the market | Binary         | 0             | 1          |          | 0 = No, 1 = Yes. The proportions of the "1" are: 10% of the companies than have more than 3 months negative balance and 0.5% of those with negative balance for 2 or less months. |


In [1]:
import pandas as pd
import numpy as np
import random
import itertools
# You need the line below only if you have problem with 'display' method in notebook
# from IPython.display import display

### Alex's dataset

In [2]:
# load dataset from .csv
final_df_Alex = pd.read_csv('financial_artificial_dataset_AK.csv')

In [3]:
# convert CloseFlag to int, thus can see stat on it
final_df_Alex['CloseFlag'] = final_df_Alex['CloseFlag'].astype('int')
display(final_df_Alex.dtypes)
display(final_df_Alex.describe())
print('\nNaN STATISTICS')
display(final_df_Alex.isna().sum())
q_str = 'CompanyID == 26 | CompanyID == 51 | CompanyID == 44 | CompanyID == 58 | CompanyID == 69'
display(final_df_Alex.query(q_str).sort_values(['CompanyID', 'Date']))

Date                  object
CompanyID              int64
Employees              int64
Revenue              float64
Expenses             float64
Profit               float64
LossFlag             float64
Region                object
BusinessValuation    float64
CloseFlag              int64
dtype: object

Unnamed: 0,CompanyID,Employees,Revenue,Expenses,Profit,LossFlag,BusinessValuation,CloseFlag
count,1200.0,1200.0,1155.0,1154.0,1154.0,1154.0,1166.0,1200.0
mean,49.5,208.27,425238.138442,223853.518925,201207.182062,0.405546,13114.752487,0.028333
std,28.878105,243.010718,333278.700364,134884.471611,296717.011844,0.49121,20650.726418,0.165993
min,0.0,10.0,620.69,1709.31,-454431.65,0.0,-44686.9,0.0
25%,24.75,47.75,90465.085,90465.21,-34573.6025,0.0,-2132.36,0.0
50%,49.5,89.5,343566.64,231687.63,174968.47,0.0,9317.785,0.0
75%,74.25,286.75,744273.185,328642.075,475166.6525,1.0,28616.585,0.0
max,99.0,989.0,999956.03,499873.51,940167.1,1.0,78344.68,1.0



NaN STATISTICS


Date                  0
CompanyID             0
Employees             0
Revenue              45
Expenses             46
Profit               46
LossFlag             46
Region                0
BusinessValuation    34
CloseFlag             0
dtype: int64

Unnamed: 0,Date,CompanyID,Employees,Revenue,Expenses,Profit,LossFlag,Region,BusinessValuation,CloseFlag
0,2016-01-01,26,54,929805.23,274406.75,655398.48,0.0,A,44840.36,0
1,2016-02-01,26,54,945682.01,316877.41,628804.6,0.0,A,50344.14,0
2,2016-03-01,26,54,809434.4,252932.49,556501.91,0.0,A,40175.79,0
3,2016-04-01,26,54,924571.51,307911.34,616660.17,0.0,A,42020.35,0
4,2016-05-01,26,54,676168.26,354345.84,321822.42,0.0,A,19198.59,0
5,2016-06-01,26,54,806109.63,366210.98,439898.65,0.0,A,33085.92,0
6,2016-07-01,26,54,978358.01,371738.1,606619.92,0.0,A,36780.04,0
7,2016-08-01,26,54,849098.38,350149.41,498948.98,0.0,A,46114.92,0
8,2016-09-01,26,54,705606.59,321700.66,383905.93,0.0,A,37414.09,0
9,2016-10-01,26,54,595487.07,305533.43,289953.65,0.0,A,16481.23,0


In [4]:
# Checking closed flag
display(final_df_Alex[final_df_Alex.CloseFlag == 1.].sort_values(['CompanyID', 'Date']))

Unnamed: 0,Date,CompanyID,Employees,Revenue,Expenses,Profit,LossFlag,Region,BusinessValuation,CloseFlag
820,2016-05-01,14,158,,,,,D,,1
821,2016-06-01,14,158,,,,,D,,1
822,2016-07-01,14,158,,,,,D,,1
823,2016-08-01,14,158,,,,,D,,1
824,2016-09-01,14,158,,,,,D,,1
825,2016-10-01,14,158,,,,,D,,1
826,2016-11-01,14,158,,,,,D,,1
827,2016-12-01,14,158,,,,,D,,1
1038,2016-07-01,58,228,,,,,E,,1
1039,2016-08-01,58,228,,,,,E,,1


### Jonathan's dataset

In [5]:
# load dataset from .csv
final_df_Jonathan = pd.read_csv('SDS_Challenge1_Jonathan/Profit_and_Loss_Challenge_Dataset.csv')

In [6]:
display(final_df_Jonathan.dtypes)
display(final_df_Jonathan.describe())
print('\nNaN STATISTICS')
display(final_df_Jonathan.isna().sum())
q_str = 'CompanyID == 27 | CompanyID == 24 | CompanyID == 44 | CompanyID == 58 | CompanyID == 69'
display(final_df_Jonathan.query(q_str).sort_values(['CompanyID', 'Date']))

Date                  object
CompanyID              int64
Revenue              float64
Expenses             float64
Profit               float64
LossFlag             float64
Employees              int64
Region                object
BusinessValuation    float64
ClosedFlag           float64
dtype: object

Unnamed: 0,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,BusinessValuation,ClosedFlag
count,1200.0,1001.0,1000.0,999.0,1001.0,1200.0,1011.0,1200.0
mean,49.5,171323.883896,104811.65568,68496.215215,0.390609,462.79,70075.638971,0.1575
std,28.878105,281115.854729,158117.994031,201721.54846,0.488131,273.296631,154138.762628,0.364423
min,0.0,10.17,1.15,-886634.43,0.0,12.0,-58696.0,0.0
25%,24.75,556.23,509.0975,-577.095,0.0,195.25,-13.5,0.0
50%,49.5,8749.96,8499.05,109.7,0.0,471.5,141.0,0.0
75%,74.25,224124.91,177288.1325,33263.74,1.0,659.0,56213.5,0.0
max,99.0,999405.64,987898.38,823915.37,1.0,998.0,937606.0,1.0



NaN STATISTICS


Date                   0
CompanyID              0
Revenue              199
Expenses             200
Profit               201
LossFlag             199
Employees              0
Region                 0
BusinessValuation    189
ClosedFlag             0
dtype: int64

Unnamed: 0,Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation,ClosedFlag
24,2016-01-31,24,53.65,46.79,6.86,0.0,85,D,1.0,0.0
124,2016-02-29,24,93.34,32.18,61.16,0.0,85,D,7.0,0.0
224,2016-03-31,24,34.99,47.61,-12.62,1.0,85,D,6.0,0.0
324,2016-04-30,24,94.56,72.13,22.43,0.0,85,D,8.0,0.0
424,2016-05-31,24,83.65,,25.67,0.0,85,D,10.0,0.0
524,2016-06-30,24,98.13,8.83,89.3,0.0,85,D,19.0,0.0
624,2016-07-31,24,79.68,81.71,-2.03,1.0,85,D,21.0,0.0
724,2016-08-31,24,44.21,74.98,-30.77,1.0,85,D,26.0,0.0
824,2016-09-30,24,,,,,85,D,,1.0
924,2016-10-31,24,,,,,85,D,,1.0


In [7]:
# Checking closed flag
display(final_df_Jonathan[final_df_Jonathan.ClosedFlag == 1.].sort_values(['CompanyID', 'Date']))

Unnamed: 0,Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation,ClosedFlag
703,2016-08-31,3,,,,,771,B,,1.0
803,2016-09-30,3,,,,,771,B,,1.0
903,2016-10-31,3,,,,,771,B,,1.0
1003,2016-11-30,3,,,,,771,B,,1.0
1103,2016-12-31,3,,,,,771,B,,1.0
909,2016-10-31,9,,,,,511,E,,1.0
1009,2016-11-30,9,,,,,511,E,,1.0
1109,2016-12-31,9,,,,,511,E,,1.0
615,2016-07-31,15,,,,,502,B,,1.0
715,2016-08-31,15,,,,,502,B,,1.0


### Kostas's dataset

In [8]:
# load dataset from .csv
final_df_Kostas = pd.read_csv('SDS_Challenge1_Kostas/ArtificialDataset_Kostas.csv')

In [9]:
display(final_df_Kostas.dtypes)
display(final_df_Kostas.describe())
print('\nNaN STATISTICS')
display(final_df_Kostas.isna().sum())
q_str = 'CompanyID == 10003 | CompanyID == 10051 | CompanyID == 10048 | CompanyID == 10058 | CompanyID == 69'
display(final_df_Kostas.query(q_str).sort_values(['CompanyID', 'Date']))

Date                  object
CompanyID              int64
Revenue              float64
Expenses             float64
Profit               float64
LossFlag             float64
Employees              int64
Region                object
BusinessValuation    float64
ClosedFlag             int64
dtype: object

Unnamed: 0,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,BusinessValuation,ClosedFlag
count,1200.0,1188.0,1186.0,1186.0,1188.0,1200.0,1200.0,1200.0
mean,10050.5,508061.0425,254536.657209,253928.404005,0.250842,53.45,16238.437983,0.04
std,28.878105,287652.995762,145610.374722,317992.817868,0.43368,28.2438,22070.384017,0.196041
min,10001.0,525.09,863.45,-452175.03,0.0,11.0,-40385.25,0.0
25%,10025.75,259395.5575,130017.43,4939.9675,0.0,25.5,98.57,0.0
50%,10050.5,507714.425,259312.99,256093.26,0.0,55.0,14738.095,0.0
75%,10075.25,759357.145,381523.95,498739.3975,1.0,78.0,29780.445,0.0
max,10100.0,999428.74,499749.54,986409.56,1.0,100.0,94774.63,1.0



NaN STATISTICS


Date                  0
CompanyID             0
Revenue              12
Expenses             14
Profit               14
LossFlag             12
Employees             0
Region                0
BusinessValuation     0
ClosedFlag            0
dtype: int64

Unnamed: 0,Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation,ClosedFlag
24,01/01/2016,10003,921729.69,327986.99,593742.7,0.0,30,A,56136.28,0
25,01/02/2016,10003,191115.44,359192.77,-168077.33,1.0,30,A,-8534.32,0
26,01/03/2016,10003,630893.35,148237.84,482655.51,0.0,30,A,46943.67,0
27,01/04/2016,10003,213723.49,464595.23,-250871.74,1.0,30,A,-11606.15,0
28,01/05/2016,10003,480110.56,95787.27,384323.29,0.0,30,A,14124.7,0
29,01/06/2016,10003,153774.76,490957.24,-337182.48,1.0,30,A,-16510.47,0
30,01/07/2016,10003,430075.41,466998.42,-36923.01,1.0,30,A,-1764.69,0
31,01/08/2016,10003,49414.5,336558.26,-287143.76,1.0,30,A,-18890.31,0
32,01/09/2016,10003,994640.34,67338.99,927301.35,0.0,30,A,75625.04,0
33,01/10/2016,10003,802608.7,249069.86,553538.84,0.0,30,A,25136.17,0


In [10]:
# Checking closed flag
display(final_df_Kostas[final_df_Kostas.ClosedFlag == 1].sort_values(['CompanyID', 'Date']))

Unnamed: 0,Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation,ClosedFlag
180,01/01/2016,10016,16891.46,136848.42,-119956.96,1.0,80,A,-8818.71,1
181,01/02/2016,10016,48908.48,320638.83,-271730.35,1.0,80,A,-25331.91,1
182,01/03/2016,10016,329879.13,161070.94,168808.19,0.0,80,A,10287.25,1
183,01/04/2016,10016,799321.99,239355.0,559966.99,0.0,80,A,45729.01,1
184,01/05/2016,10016,448185.59,396747.99,51437.6,0.0,80,A,4498.13,1
185,01/06/2016,10016,965253.12,11687.08,953566.04,0.0,80,A,55830.98,1
186,01/07/2016,10016,186737.7,226062.7,-39325.0,1.0,80,A,-2963.75,1
187,01/08/2016,10016,773611.85,201049.34,572562.51,0.0,80,A,24497.88,1
188,01/09/2016,10016,120609.44,70977.13,49632.31,0.0,80,A,2807.97,1
189,01/10/2016,10016,180986.42,73930.2,107056.22,0.0,80,A,4786.69,1
