# About this Notebook
This notebook is used in order to obtain all important information about our training data in a table format. We extract the data by using a crawler-method that iterates through all possible directory and reads out all desired information and put them into a pandas dataframe. Afterwards, a second crawler-method iterates through every entry of that dataframe and extracts the information about every line of text in the file. This information is then added to the dataframe.

*BLACK WHITE VERSION *
Change file_path in generate_line_images(.) in data_extraction.py to file_path = Path(dir_path).joinpath("line" + str(row["line_number"]) + "_bw.png") # Delete '_bw'.

Approach utilizes the DataExtractor class and can be split into the following steps:

0. *de = DataExtractor():* Instanciate DataExtractor class, which is used to extract the data from the files.
1. *de.extract_data():* Extract data about each image from json files and store in dataframe.
2. *de.get_lines_data():* Extract data about each line of text from each dataframe row and store in second dataframe.
3. *de.generate_line_images():* Crop each row of text from original images, store them in new folder in existing structure, add paths to second dataframe.
4. *de.get_line_image_paths():* Optionally, the image cropping can be skipped (if already done) and the paths of the cropped images can be loaded directly from the second dataframe.

To save or load the dataframes, the following methods can be used:

* *de.to_json():* Save one or both dataframes as json files.
* *de.from_json():* Load one or both dataframes from json files.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import os
import json
import cv2

# In case the setup of the src module doesn't work, use sys.path to find the imported class. If sys.path is used, the src. must be removed from the import statement.
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
     sys.path.append(Path(module_path).joinpath("src").as_posix())

from data_extraction import DataExtractor

In [2]:
de = DataExtractor(top_dir_path="../data/raw")
df = de.extract_data()
df

Unnamed: 0_level_0,img_path,img_width,img_height,bbox,char_width,char_height,ln_start,ln_end,lines_data,font,theme,timestamp,language,repository,file
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,1917,997,"[382, 84, 751, 320]",9.265409,20.0,150,166.0,"[{'x': 382, 'y': 64, 'line_number': 150, 'heig...",Hack,RailsCasts,"2022/08/27, 02:05:27",PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php
img2,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,1917,997,"[122, 65, 686, 660]",9.272727,20.0,1,33.0,"[{'x': 122, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:05:41",PHP,phpDocumentor~TypeResolver,tests~unit~FqsenResolverTest.php
img3,../data/raw/PHP/phpDocumentor~TypeResolver/src...,1917,997,"[382, 65, 686, 660]",9.267857,20.0,1,33.0,"[{'x': 382, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:05:55",PHP,phpDocumentor~TypeResolver,src~TypeResolver.php
img4,../data/raw/PHP/phpDocumentor~TypeResolver/src...,1917,997,"[382, 65, 1093, 660]",9.269231,20.0,1,33.0,"[{'x': 382, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:06:09",PHP,phpDocumentor~TypeResolver,src~Types~Object_.php
img5,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,1917,997,"[382, 65, 686, 660]",9.266714,20.0,1,33.0,"[{'x': 382, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:06:23",PHP,phpDocumentor~TypeResolver,tests~unit~CollectionResolverTest.php
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
img11213,../data/raw/Shell/webdevops~Dockerfile/docker~...,766,942,"[56, 57, 411, 855]",4.897698,15.0,1,57.0,"[{'x': 56, 'y': 57, 'line_number': 1, 'height'...",ProggyCleanTT,Chameleon,"2022/08/28, 02:18:11",Shell,webdevops~Dockerfile,docker~php-nginx-dev~debian-8~conf~provision~e...
img11214,../data/raw/Shell/webdevops~Dockerfile/docker~...,766,942,"[56, 57, 184, 855]",4.897959,15.0,1,57.0,"[{'x': 56, 'y': 57, 'line_number': 1, 'height'...",ProggyCleanTT,Chameleon,"2022/08/28, 02:18:26",Shell,webdevops~Dockerfile,docker~php-apache-dev~7.1~conf~provision~entry...
img11215,../data/raw/Shell/webdevops~Dockerfile/docker~...,766,942,"[56, 69, 181, 840]",4.903226,15.0,74,104.2,"[{'x': 56, 'y': 54, 'line_number': -0.9, 'heig...",ProggyCleanTT,Chameleon,"2022/08/28, 02:18:37",Shell,webdevops~Dockerfile,docker~php-apache-dev~7.3~conf~provision~entry...
img11216,../data/raw/Shell/webdevops~Dockerfile/docker~...,766,942,"[56, 57, 181, 855]",4.892857,15.0,1,49.1,"[{'x': 56, 'y': 57, 'line_number': 1, 'height'...",ProggyCleanTT,Chameleon,"2022/08/28, 02:18:53",Shell,webdevops~Dockerfile,docker~php-apache-dev~centos-7-php7~conf~provi...


In [3]:
# Replace colorful image with black white image, i.e. "img.png" -> "bw.png"
df["img_path"] = df["img_path"].str.replace("img.png", "bw.png")

# Save df as class attribute
de.data = df

# Preview
de.data.head(5)

  df["img_path"] = df["img_path"].str.replace("img.png", "bw.png")


Unnamed: 0_level_0,img_path,img_width,img_height,bbox,char_width,char_height,ln_start,ln_end,lines_data,font,theme,timestamp,language,repository,file
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,1917,997,"[382, 84, 751, 320]",9.265409,20.0,150,166.0,"[{'x': 382, 'y': 64, 'line_number': 150, 'heig...",Hack,RailsCasts,"2022/08/27, 02:05:27",PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php
img2,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,1917,997,"[122, 65, 686, 660]",9.272727,20.0,1,33.0,"[{'x': 122, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:05:41",PHP,phpDocumentor~TypeResolver,tests~unit~FqsenResolverTest.php
img3,../data/raw/PHP/phpDocumentor~TypeResolver/src...,1917,997,"[382, 65, 686, 660]",9.267857,20.0,1,33.0,"[{'x': 382, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:05:55",PHP,phpDocumentor~TypeResolver,src~TypeResolver.php
img4,../data/raw/PHP/phpDocumentor~TypeResolver/src...,1917,997,"[382, 65, 1093, 660]",9.269231,20.0,1,33.0,"[{'x': 382, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:06:09",PHP,phpDocumentor~TypeResolver,src~Types~Object_.php
img5,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,1917,997,"[382, 65, 686, 660]",9.266714,20.0,1,33.0,"[{'x': 382, 'y': 65, 'line_number': 1, 'height...",Hack,RailsCasts,"2022/08/27, 02:06:23",PHP,phpDocumentor~TypeResolver,tests~unit~CollectionResolverTest.php


In [4]:
# Check
de.data.iloc[125,0]

'../data/raw/PHP/timber~starter-theme/single.php/bw.png'

In [5]:
df_lines = de.extract_lines_data()
de.to_json(attr="all", file_name=["data_bw.json", "lines_data_no_path_bw.json"], to_dir="../data/extracted")
df_lines

Removed negatives x-values in df_lines. Total: 561


Unnamed: 0_level_0,img_id,img_path,font,theme,language,repository,file,line_number,x,y,height,width,character_width,code_width,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
line1,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,150.0,382,64,20.0,1415.0,9.269231,482.0,* @uses \phpDocumentor\Reflection\Types\C...
line2,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,151.0,382,84,20.0,1415.0,9.264151,491.0,* @uses \phpDocumentor\Reflection\Types\C...
line3,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,152.0,382,104,20.0,1415.0,9.272727,510.0,* @uses \phpDocumentor\Reflection\Types\C...
line4,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,153.0,382,124,20.0,1415.0,9.269231,482.0,* @uses \phpDocumentor\Reflection\Types\S...
line5,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,154.0,382,144,20.0,1415.0,9.333333,56.0,*
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
line534492,img11217,../data/raw/Shell/webdevops~Dockerfile/docker~...,ProggyCleanTT,Chameleon,Shell,webdevops~Dockerfile,docker~php-nginx-dev~ubuntu-15.04~conf~provisi...,31.0,74,777,24.0,713.0,7.962963,181.0,"/usr/local/etc/php/conf.d/"""
line534493,img11217,../data/raw/Shell/webdevops~Dockerfile/docker~...,ProggyCleanTT,Chameleon,Shell,webdevops~Dockerfile,docker~php-nginx-dev~ubuntu-15.04~conf~provisi...,32.0,74,801,,,,,
line534494,img11217,../data/raw/Shell/webdevops~Dockerfile/docker~...,ProggyCleanTT,Chameleon,Shell,webdevops~Dockerfile,docker~php-nginx-dev~ubuntu-15.04~conf~provisi...,33.0,74,825,24.0,713.0,7.964286,181.0,function phpModuleRemove() {
line534495,img11217,../data/raw/Shell/webdevops~Dockerfile/docker~...,ProggyCleanTT,Chameleon,Shell,webdevops~Dockerfile,docker~php-nginx-dev~ubuntu-15.04~conf~provisi...,34.0,74,849,24.0,713.0,7.962963,181.0,"if [ ""$#"" -ne 1 ]; then"


In [6]:
# Check
df_lines.iloc[301, 1]

'../data/raw/PHP/hwi~HWIOAuthBundle/src~OAuth~ResourceOwner~EveOnlineResourceOwner.php/bw.png'

In [7]:
de.lines_data.head(5)

Unnamed: 0_level_0,img_id,img_path,font,theme,language,repository,file,line_number,x,y,height,width,character_width,code_width,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
line1,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,150.0,382,64,20.0,1415.0,9.269231,482.0,* @uses \phpDocumentor\Reflection\Types\C...
line2,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,151.0,382,84,20.0,1415.0,9.264151,491.0,* @uses \phpDocumentor\Reflection\Types\C...
line3,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,152.0,382,104,20.0,1415.0,9.272727,510.0,* @uses \phpDocumentor\Reflection\Types\C...
line4,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,153.0,382,124,20.0,1415.0,9.269231,482.0,* @uses \phpDocumentor\Reflection\Types\S...
line5,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,154.0,382,144,20.0,1415.0,9.333333,56.0,*


In [8]:
de.lines_data.head(5).iloc[0, 1]

'../data/raw/PHP/phpDocumentor~TypeResolver/tests~unit~IntegerRangeResolverTest.php/bw.png'

In [9]:
# Generate images for each line of code in the dataset, crop images using full width of each line of code, save resulting dataframe in json file
#df_lines_fw = de.generate_line_images(save_dir="line_images_fw", use_code_width=False)
#de.to_json(attr="all", file_name=["data.json", "lines_data_fw.json"], to_dir="../data/extracted")

# Generate images for each line of code in the dataset, crop images using only the code width of each line of code, save resulting dataframe in json file
df_lines_cw_bw = de.generate_line_images(save_dir="line_images_cw", use_code_width=True)
de.to_json(attr="lines_data", file_name="lines_data_cw_bw.json", to_dir="../data/extracted")

In [10]:
# Optional (not very important method): Load line_image_paths of each code line WITHOUT the cropping of the images
# df_lines = de.get_line_image_paths(save_dir="line_images_fw")

In [11]:
# Load data & lines_data from json files
df_data, df_lines_cw_bw = de.from_json(attr="all", file_name=["data_bw.json", "lines_data_cw_bw.json"], from_dir="../data/extracted")


In [12]:
df_lines_cw_bw.iloc[1, 2]

'../data/raw/PHP/phpDocumentor~TypeResolver/tests~unit~IntegerRangeResolverTest.php/line_images_cw/line151.0_bw.png'

In [13]:
df_lines_cw_bw.head(5)

Unnamed: 0,img_id,img_path,line_img_path,font,theme,language,repository,file,line_number,x,y,height,width,character_width,code_width,text
line1,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,150.0,382,64,20.0,1415.0,9.269231,482.0,* @uses \phpDocumentor\Reflection\Types\C...
line2,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,151.0,382,84,20.0,1415.0,9.264151,491.0,* @uses \phpDocumentor\Reflection\Types\C...
line3,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,152.0,382,104,20.0,1415.0,9.272727,510.0,* @uses \phpDocumentor\Reflection\Types\C...
line4,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,153.0,382,124,20.0,1415.0,9.269231,482.0,* @uses \phpDocumentor\Reflection\Types\S...
line5,img1,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,../data/raw/PHP/phpDocumentor~TypeResolver/tes...,Hack,RailsCasts,PHP,phpDocumentor~TypeResolver,tests~unit~IntegerRangeResolverTest.php,154.0,382,144,20.0,1415.0,9.333333,56.0,*


In [14]:
df_lines_cw_bw.head(5).iloc[2,2]

'../data/raw/PHP/phpDocumentor~TypeResolver/tests~unit~IntegerRangeResolverTest.php/line_images_cw/line152.0_bw.png'

In [15]:
df_lines_cw_bw.head(5).iloc[2,2]

'../data/raw/PHP/phpDocumentor~TypeResolver/tests~unit~IntegerRangeResolverTest.php/line_images_cw/line152.0_bw.png'