# Welcome to my Notebook, I tried to create a beginner-friendly detailed notebook to explain the complex filesystem.

![image.png](attachment:2896ab32-cf43-42d6-ad01-fe54bf650f13.png)

**Purpose of this notebook is to analyze competition files, unlike other competitions financial data may be complex.**

**In my next two notebooks, I will visualise the data and submit my model predictions with respect to.**

**Let's start with our dataset, we can observe 5 folders on our main directory and one csv file, we can try to comprehend it before delve deeper.**

![image.png](attachment:225483e8-dd27-49d4-b0d6-d35e8619c297.png)
* data_specifications: Definitions for individual columns.
* example_test_files: Data folder covering the public test period. Intended to facilitate offline testing. Includes the same columns delivered by the API (ie no Target column). You can calculate the Target column from the Close column; it's the return from buying a stock the next day and selling the day after that. This folder also includes an example of the sample submission file that will be delivered by the API.
* jpx_tokyo_market_prediction: Files that enable the API. Expect the API to deliver all rows in under five minutes and to reserve less than 0.5 GB of memory.
* supplemental_files: Data folder containing a dynamic window of supplemental training data. This will be updated with new data during the main phase of the competition in early May, early June, and roughly a week before the submissions are locked.
* train_files: Data folder covering the main training period.
* stock_list.csv: Mapping between the SecuritiesCode and company names, plus general information about which industry the company is in.

**stock_list.csv**
When we check the csv file, we can observe 16 different columns.
1. SecuritiesCode: A unique code for each different security.
2. EffectiveDate: the date that an agreement or transaction between or among signatories becomes binding.
3. Name: Name of securities.
4. Section/Products: Type of products in strings.
5. New Market Segment: Denotes different market types (growth market or standard market etc.)
6. 33SectorCode: Specific sector codes for securities.
7. 33SectorName: Specific sector names for securities.
8. 17SectorCode: Specific sector codes for securities.
9. 17SectorName: Specific sector names for securities.
10. New Index Series Size Code: I do not have enough information.
11. New Index Series Size:  I do not have enough information.
12. Tradedatenbsp: Specific dates for trades.
13. Close: close price for a security.
14. Issued Shares: The amount of total issued shares.
15. Market Capitalization: the value of a company that is traded on the stock market, calculated by multiplying the total number of shares by the present share price.
16. Universe0: ?

**Now, let's check our data**

In [1]:
import pandas as pd
tmp_df = pd.read_csv('../input/jpx-tokyo-stock-exchange-prediction/stock_list.csv')

print(f'stock_list.csv consists of {tmp_df.shape[0]} rows and {tmp_df.shape[1]} columns.')

stock_list.csv consists of 4417 rows and 16 columns.


**Now we are gonna check the first folder, data_specifications..**
There are 5 different csv files:
* options_spec.csv: specification for options logs, you can [check](https://www.investopedia.com/terms/o/option.asp).
* stock_fin_spec.csv: Stock finance is a type of funding whereby the borrower uses a lender's funds in order to purchase product to sell.
* stock_list_spec.csv: list of stocks and its features
* stock_price_spec.csv: stock price features
* trades_spec.csv: list for trades

**Our next target is options_spec.csv:**

options_spec.csv contains 5 columns,

Column: There are name of some parameters(Close, High, Date etc.)

Sample Value: Example of different values for particular columns.

Type: Different datatypes for different columns.(string, int, date, float etc.)

Addendum: An addendum is an attachment to a contract that modifies the terms and conditions of the original contract.

Remarks: Annotations for columns, you can check the remarks to understand different columns, parameters.

**stock_fin_spec.csv, stock_list_spec.csv, stock_price_spec.csv and trades_spec.csv, contain the identical columns with options_spec.csv.**

In [2]:
import os
for filename in os.listdir('../input/jpx-tokyo-stock-exchange-prediction/data_specifications'):
    f = os.path.join('../input/jpx-tokyo-stock-exchange-prediction/data_specifications', filename)
    # checking if it is a file
    if os.path.isfile(f):
        tmp_df = pd.read_csv(f)
        filename = f.split('/')[-1]
        print(f'{filename} consists of {tmp_df.shape[0]} rows and {tmp_df.shape[1]} columns.')

stock_fin_spec.csv consists of 45 rows and 5 columns.
trades_spec.csv consists of 56 rows and 6 columns.
stock_price_spec.csv consists of 12 rows and 5 columns.
options_spec.csv consists of 31 rows and 5 columns.
stock_list_spec.csv consists of 16 rows and 5 columns.


*I am not going to repeat myself, so I am not going to keep mentioning the common columns.*

Next folder we are gonna check is, example_test_files. There are 6 different csv files:

financials.csv: financial information about securities, its date and disclosure IDs.

options.csv: similar to data_specifications.

sample_submission.csv: sample submission file in order to guide kaggle competitors.

secondary_stock_prices.csv: there are two different stock prices, I do not understand the logic behind 'secondary'.

stock_prices.csv: as its name refers..

trades.csv: columns for trades.

In [3]:
for filename in os.listdir('../input/jpx-tokyo-stock-exchange-prediction/example_test_files'):
    f = os.path.join('../input/jpx-tokyo-stock-exchange-prediction/example_test_files', filename)
    # checking if it is a file
    if os.path.isfile(f):
        tmp_df = pd.read_csv(f)
        filename = f.split('/')[-1]
        print(f'{filename} consists of {tmp_df.shape[0]} rows and {tmp_df.shape[1]} columns.')

sample_submission.csv consists of 112000 rows and 3 columns.
options.csv consists of 9076 rows and 31 columns.
financials.csv consists of 14 rows and 45 columns.
secondary_stock_prices.csv consists of 4178 rows and 11 columns.
trades.csv consists of 2 rows and 56 columns.
stock_prices.csv consists of 4000 rows and 11 columns.


**Let's check jpx_tokyo_market_prediction folder:**

There are two files,

__init__.py: for competition host's submission API.

competition.cpython-37m-x86_64-linux-gnu.so: I do not know what is this linux file, but it looks cool and probably related with submission API

**Supplemental_files folder is helpful in order give us more information about competition.**

financials.csv, options.csv, secondary_stock_prices.csv, stock_prices.csv and trades.csv are same with example_test_files.

In [4]:
for filename in os.listdir('../input/jpx-tokyo-stock-exchange-prediction/supplemental_files'):
    f = os.path.join('../input/jpx-tokyo-stock-exchange-prediction/supplemental_files', filename)
    # checking if it is a file
    if os.path.isfile(f):
        tmp_df = pd.read_csv(f)
        filename = f.split('/')[-1]
        print(f'{filename} consists of {tmp_df.shape[0]} rows and {tmp_df.shape[1]} columns.')

  exec(code_obj, self.user_global_ns, self.user_ns)


options.csv consists of 500408 rows and 31 columns.
financials.csv consists of 9434 rows and 45 columns.
secondary_stock_prices.csv consists of 242704 rows and 12 columns.
trades.csv consists of 165 rows and 56 columns.
stock_prices.csv consists of 229958 rows and 12 columns.


Now, its time to explain the crucial folder; train_files

Train_Files might be most important in this dataset, we will train our models on those train files.

Files are still identical with prior folder's, we are not going to delve deeper.

See you on the LB, good luck!