# About this Notebook
This notebook is used in order to obtain all important information about our training data in a table format. We extract the data by using a crawler-method that iterates through all possible directory and reads out all desired information and put them into a pandas dataframe. Afterwards, a second crawler-method iterates through every entry of that dataframe and extracts the information about every line of text in the file. This information is then added to the dataframe.

Approach utilizes the DataExtractor class and can be split into the following steps:

0. *de = DataExtractor():* Instanciate DataExtractor class, which is used to extract the data from the files.
1. *de.extract_data():* Extract data about each image from json files and store in dataframe.
2. *de.get_lines_data():* Extract data about each line of text from each dataframe row and store in second dataframe.
3. *de.generate_line_images():* Crop each row of text from original images, store them in new folder in existing structure, add paths to second dataframe.
4. *de.get_line_image_paths():* Optionally, the image cropping can be skipped (if already done) and the paths of the cropped images can be loaded directly from the second dataframe.

To save or load the dataframes, the following methods can be used:

* *de.to_json():* Save one or both dataframes as json files.
* *de.from_json():* Load one or both dataframes from json files.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
import json
import cv2

# In case the setup of the src module doesn't work, use sys.path to find the imported class. If sys.path is used, the src. must be removed from the import statement.
# import sys
# module_path = os.path.abspath(os.path.join('..'))
# if module_path not in sys.path:
#     sys.path.append(Path(module_path).joinpath("src").as_posix())

from src.data_extraction import DataExtractor

In [3]:
de = DataExtractor(top_dir_path="../data/raw")
df = de.extract_data()
df

Unnamed: 0_level_0,img_path,img_width,img_height,bbox,char_width,char_height,ln_start,ln_end,lines_data,font,theme,timestamp,language,repository,file
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
img1,../data/raw/CoffeeScript/abe33~atom-color-high...,1657,818,"[426, 57, 614, 616]",10.072727,22.0,1,28.0,"[{'x': 426, 'y': 57, 'line_number': 1, 'height...",Andale Mono,Learn with Sumit Theme,"2022/08/25, 15:52:20",CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee
img2,../data/raw/CoffeeScript/abe33~atom-color-high...,1438,818,"[78, 57, 524, 616]",10.068966,22.0,1,28.0,"[{'x': 78, 'y': 57, 'line_number': 1, 'height'...",Andale Mono,Learn with Sumit Theme,"2022/08/25, 15:51:59",CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-model.coffee
img3,../data/raw/CoffeeScript/abe33~atom-color-high...,1438,818,"[78, 57, 635, 594]",10.071429,22.0,1,27.0,"[{'x': 78, 'y': 57, 'line_number': 1, 'height'...",Andale Mono,Learn with Sumit Theme,"2022/08/25, 15:51:48",CoffeeScript,abe33~atom-color-highlight,lib~dot-marker-element.coffee
img4,../data/raw/CoffeeScript/abe33~atom-color-high...,1438,818,"[78, 57, 473, 616]",10.069366,22.0,1,28.0,"[{'x': 78, 'y': 57, 'line_number': 1, 'height'...",Andale Mono,Learn with Sumit Theme,"2022/08/25, 15:51:38",CoffeeScript,abe33~atom-color-highlight,lib~marker-element.coffee
img5,../data/raw/CoffeeScript/abe33~atom-color-high...,1438,818,"[78, 57, 1038, 616]",10.072464,22.0,2,28.0,"[{'x': 78, 'y': 57, 'line_number': 1, 'height'...",Andale Mono,Learn with Sumit Theme,"2022/08/25, 15:52:08",CoffeeScript,abe33~atom-color-highlight,spec~atom-color-highlight-spec.coffee
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
img6231,../data/raw/TypeScript/yiminghe~async-validato...,1198,1839,"[75, 57, 687, 1760]",9.154509,22.0,1,80.0,"[{'x': 75, 'y': 57, 'line_number': 1, 'height'...",Victor Mono,Shades of Purple,"2022/08/28, 04:19:52",TypeScript,yiminghe~async-validator,src~index.ts
img6232,../data/raw/TypeScript/yiminghe~async-validato...,1198,1839,"[75, 57, 876, 1760]",9.152558,22.0,1,80.0,"[{'x': 75, 'y': 57, 'line_number': 1, 'height'...",Victor Mono,Shades of Purple,"2022/08/28, 04:20:04",TypeScript,yiminghe~async-validator,src~interface.ts
img6233,../data/raw/TypeScript/yiminghe~async-validato...,1198,1838,"[75, 57, 723, 1100]",9.156047,22.0,1,50.0,"[{'x': 75, 'y': 57, 'line_number': 1, 'height'...",Victor Mono,Shades of Purple,"2022/08/28, 04:20:41",TypeScript,yiminghe~async-validator,src~rule~range.ts
img6234,../data/raw/TypeScript/yiminghe~async-validato...,1198,1838,"[75, 57, 876, 1584]",9.155556,22.0,1,72.0,"[{'x': 75, 'y': 57, 'line_number': 1, 'height'...",Victor Mono,Shades of Purple,"2022/08/28, 04:20:27",TypeScript,yiminghe~async-validator,src~rule~url.ts


In [4]:
df_lines = de.extract_lines_data()
de.to_json(attr="all", file_name=["data.json", "lines_data_no_path.json"], to_dir="../data/extracted")
df_lines

Removed negatives in df_lines. Total: 346


Unnamed: 0_level_0,img_id,img_path,font,theme,language,repository,file,line_number,x,y,height,width,character_width,code_width,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
line1,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,1.0,426,57,22.0,915.0,10.068966,292.0,_ = require 'underscore-plus'
line2,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,2.0,426,79,22.0,915.0,10.072727,554.0,"{CompositeDisposable, Disposable} = require 'e..."
line3,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,3.0,426,101,,,,,
line4,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,4.0,426,123,22.0,915.0,10.071429,423.0,MarkerElement = require './marker-element'
line5,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,5.0,426,145,22.0,915.0,10.081633,494.0,DotMarkerElement = require './dot-marker-element'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
line293663,img6235,../data/raw/TypeScript/yiminghe~async-validato...,Victor Mono,Shades of Purple,TypeScript,yiminghe~async-validator,__tests__~validator.spec.ts,76.0,75,1707,22.0,876.0,9.111111,82.0,{
line293664,img6235,../data/raw/TypeScript/yiminghe~async-validato...,Victor Mono,Shades of Purple,TypeScript,yiminghe~async-validator,__tests__~validator.spec.ts,77.0,75,1729,22.0,876.0,9.159091,403.0,"validator(rule, value, callback) {"
line293665,img6235,../data/raw/TypeScript/yiminghe~async-validato...,Victor Mono,Shades of Purple,TypeScript,yiminghe~async-validator,__tests__~validator.spec.ts,78.0,75,1751,22.0,876.0,9.157895,348.0,callback(new Error('e1'));
line293666,img6235,../data/raw/TypeScript/yiminghe~async-validato...,Victor Mono,Shades of Purple,TypeScript,yiminghe~async-validator,__tests__~validator.spec.ts,79.0,75,1773,22.0,876.0,9.166667,110.0,"},"


In [13]:
# Generate images for each line of code in the dataset, crop images using full width of each line of code, save resulting dataframe in json file
#df_lines_fw = de.generate_line_images(save_dir="line_images_fw", use_code_width=False)
#de.to_json(attr="all", file_name=["data.json", "lines_data_fw.json"], to_dir="../data/extracted")

# Generate images for each line of code in the dataset, crop images using only the code width of each line of code, save resulting dataframe in json file
df_lines_cw = de.generate_line_images(save_dir="line_images_cw", use_code_width=True)
de.to_json(attr="lines_data", file_name="lines_data_cw.json", to_dir="../data/extracted")

In [None]:
# Optional (not very important method): Load line_image_paths of each code line WITHOUT the cropping of the images
# df_lines = de.get_line_image_paths(save_dir="line_images_fw")

In [14]:
# Load data & lines_data from json files
df_data, df_lines_cw = de.from_json(attr="all", file_name=["data.json", "lines_data_cw.json"], from_dir="../data/extracted")

In [11]:
df_lines_cw.iloc[1, 2]

'Andale Mono'

In [17]:
df_lines_cw.head(5)

Unnamed: 0,img_id,img_path,line_img_path,font,theme,language,repository,file,line_number,x,y,height,width,character_width,code_width,text
line1,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,1.0,426,57,22.0,915.0,10.068966,292.0,_ = require 'underscore-plus'
line2,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,2.0,426,79,22.0,915.0,10.072727,554.0,"{CompositeDisposable, Disposable} = require 'e..."
line3,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,3.0,426,101,,,,,
line4,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,4.0,426,123,22.0,915.0,10.071429,423.0,MarkerElement = require './marker-element'
line5,img1,../data/raw/CoffeeScript/abe33~atom-color-high...,../data/raw/CoffeeScript/abe33~atom-color-high...,Andale Mono,Learn with Sumit Theme,CoffeeScript,abe33~atom-color-highlight,lib~atom-color-highlight-element.coffee,5.0,426,145,22.0,915.0,10.081633,494.0,DotMarkerElement = require './dot-marker-element'
