# Introduction

In this article, we will present how we can handle semi-structured documents using TFID, a Natural Language Processing technique to create a table containing all the information we are looking to analyze.

The challenge with semi-structured data is that we don't know, in advance, the data structure we are looking for, after all, labels can change over time and depend on the product being described. 

Think about an Invoice, a Technical Specifications or a Bill of Materials for instance: although we know that we will find a series of labels (or descriptors) and their respective values, we don't know, in advance, what labels we will find within all the documents we might receive, since each one is describing a different product (or item). 

To make things worse, this is an expected situation of any semi-structured data: after all, we don't know what may come in the future: why should technical specifications of an old car specify if there is a USB connector in the car if there was obviously no USB ten years ago?

Also, why should a bill of material specify the colour of a tissue if it describes the materials needed to make a table (that doesn't use any tissue)?  

Still, we may need to find all the tissue consumed by a furniture factory, that might be producing not only tables but sofas too!

This makes the traditional way of defining "labels" and extracting their corresponding values completely useless, or at least not scalable in the long run. 

Still, let's accept this challenge and see how we can solve it: TFIDF here we go!


# General Strategy

Here we will be extracting information about the technical specifications of cars: to create a big table containing all the features a car might have, we will look at their technical specifications in order to:
1/Find what features should be included in the table columns and 
2/Fullfill the columns with the specifications for each car. 

For this analysis we will use a "Term Frequency - Inverse Document Frequency" (TF-IDF) technique to weigh each word of the car's specification: this will allow us to define a threshold and easily identify "labels" and "values": where label words have lower values and "value words" have, well, higher values. 

Although TD-IDF has diverse implications and a complex definition that we will not cover in this article (but it can be saw here) its concept is relatively easy to grasp: it basically weights common words with lower values (in our cases, words that are present in Labels) and rare words with higher values (in our cases, values presented in the technical specifications): just what we need to extract the label-values pair from our car technical specifications!

As a final step, we will clean the labels found during a first round to make sure they are part of our common language, after all, we don't want labels to include jargon words, making sure that the information is understandable by all stakeholders.

# Technical setp-up

To handle our challenge, we will use a large set of Python libraries: not only our friends Pandas and Numpy but also re (for regex modifications) stopwords and other libraries that will help to solve our journey, so here is our initial set-up.

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import string
import re
import os
from nltk.corpus import stopwords
import math
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter


# First Round: estimating label candidates

Here is a escrapt of our data (in English for the sake of clarity): composed of more than 20 thousant files of technical specifications of cars (!) each line present some feature and its respective value: keep in mind not all values are numbers (E.g. car origin or fuel) nor are all labels in front of values (check the car cylinders for instance):

<code>

<br/> 
Year 2003 <br/> 
Price R$ 30,303<br/> 
Combustion Propulsion<br/> 
Gasoline Fuel<br/> 
National Origin<br/> 
Warranty 1 year <br/> 
Jeep Configuration<br/> 
Compact Size<br/> 
Places 5<br/> 
4 doors<br/> 
Motor<br/> 
Front Installation<br/> 
Longitudinal Layout<br/> 
Vacuuming Natural<br/> 
Power Multipoint injection<br/> 
3 cylinders in line<br/> 
Single valve control in the cylinder head<br/> 
Valves per cylinder 4<br/> 
Cylinder diameter 81.5 mm<br/> 
Piston stroke 95.8 mm<br/> 
Compression ratio 9.5:1<br/> 
Equipment<br/> 
Standard equipment ABS brakes<br/> 
Standard equipment Front airbags<br/> 
Standard equipment Side airbags<br/> 
Standard equipment Perimeter anti-theft alarm<br/> 
Standard equipment Rear camera for maneuvering<br/> 
Standard equipment 3-point seat belts for all occupants<br/> 
Standard equipment Headrest for all occupants<br/> 
Radio series equipment<br/> 
CD player standard equipment<br/> 
Standard equipment Front airbags<br/> 
Optional equipment Perimeter anti-theft alarm<br/> 
Optional equipment Volumetric anti-theft alarm<br/> 
Standard equipment Headrest for all occupants<br/> 
Optional equipment Fog lights<br/> 
Standard equipment Rear window wiper and washer<br/> 
Optional equipment Rear fog light<br/> 
Optional equipment Side turn signal repeaters<br/> 
Optional equipment Central door locking
</code>


Ok, let's go ahead and extract some sample files

In [4]:
# Set the file path
file_path = "path/to/the/folder/"

files = os.listdir(file_path)
random.shuffle(files) # shuffle the files to get a random sample
text_files = files[1000:1030]  # get a sample of 30 files


FileNotFoundError: [Errno 2] No such file or directory: 'path/to/the/folder/'

In [9]:
text_files  = ['22397.txt',
 '16273.txt',
 '13249.txt',
 '21550.txt',
 '22013.txt',
 '23487.txt',
 '14521.txt',
 '14910.txt',
 '934.txt',
 '14426.txt',
 '22821.txt',
 '9066.txt',
 '2087.txt',
 '5011.txt',
 '1667.txt',
 '6068.txt',
 '3122.txt',
 '19168.txt',
 '21872.txt',
 '11505.txt',
 '14847.txt',
 '12923.txt',
 '22962.txt',
 '15261.txt',
 '1185.txt',
 '3034.txt',
 '2039.txt',
 '4564.txt',
 '5563.txt',
 '15147.txt']

file_path = '/home/marcnaweb/code/marcnaweb/car_recommendation_engine/raw_data/text_files/extracted_datasheets/'

Since our content are have values for each line, let's open our files and create our content in a "per line" based

In [10]:
text_content = []

for file in text_files:
    with open(f"{file_path}{file}", 'r') as file:
        content = file.read()
        try:

            content = content.split('\n') # split the content by lines --> each line is (very small) document
        except:
            pass
    temp = []
    for line in content:
        #doing some basic cleaning
        #pass
        # Remove words in between brackets
        line = re.sub(r'\[.*?\]', ' ', line)

        words = line.split()
        # Remove words that contain a special character
        words = [word for word in words if all(  c == '\n' or c == '/'  or c =='.' or c == ',' for c in string.punctuation if c in word)]

        #include a position indicator to the words that are not stopwords
        # from left to right
        words = [f"{i}_{word}" if word not in stopwords.words('portuguese') else word for i, word in enumerate(words)]
        # from right to left
        words.reverse()
        words = [f"{i}_{word}" if word not in stopwords.words('portuguese') else word for i, word in enumerate(words)]
        #reput the words in the right order
        words.reverse()

        cleaned_line = ' '.join(words)
        #print(cleaned_line)
        text_content.append(cleaned_line)

#print(text_content)


Yes, you spot it: we are not only cleaning a bit our data but also enriching our words with a position indicator: from right to left and left to righ, why? 
Remember that we spot that labels (and values) might be in the beginning or in the end end of each line: e.g. in "Year 2003" and "Warrenty 1 year" the word "year" might have a different value (or meaning) depending or its position in the sentence, and if we don't include this indicator, our TD-IDF matrix will give the same value for it and may completely ruin our results: after all, in "Year 2003" the _label_ is year while in "Warrenty 1 year" the _value_ is year, isn't it? 

It's show time: get ready for your engine to run (and your computer to heat! ;)
Heredown we make our matrix of weighted words and respective sentence, as we will see this is essential for continuing with our analisys.

In [11]:

corpus = text_content
vectorizer = TfidfVectorizer(min_df=1 , stop_words=stopwords.words('portuguese') , token_pattern=r'\b[\w_,.]*[_,.][\w_,.]*\b' )  #
X = vectorizer.fit(corpus)
vectorizer.transform(corpus).toarray()
vectorizer.get_feature_names_out()

weighted_words = pd.DataFrame(vectorizer.fit_transform(corpus).toarray(),
                 columns = vectorizer.get_feature_names_out(), index = corpus)

weighted_words.reset_index(inplace=True)
weighted_words["temp"] = weighted_words["index"].apply(lambda x: len(x.split()))
weighted_words =    weighted_words[weighted_words["temp"] > 0 ].drop(columns=["temp"])
weighted_words.reset_index(inplace=True)
weighted_words.drop(columns=["level_0"], inplace=True)


Now let's stop for a while and see what we have here, how are our weighted words behaving in the different lines?

In [12]:
line_number = 28917
line = weighted_words[weighted_words.index == line_number ]['index'].values[0]
line_weight = weighted_words[weighted_words.index == line_number].select_dtypes(include=[np.number]).T.sort_values(line_number, ascending=False)

print(line)
line_weight[line_weight[line_number] > 0]

3_0_Velocidade 2_1_máxima 1_2_158 0_3_km/h


Unnamed: 0,28917
1_2_158,0.612644
0_3_km,0.472319
3_0_velocidade,0.472319
2_1_máxima,0.422489


So let's stop and see what we have. 
- the original input was: "Velocidade máxima 158 km/h"  (meaning: Maximum Speed 158 km/h) 
- after the initial transformation we have: "3_0_Velocidade 2_1_máxima 1_2_158 0_3_km/h"
And the weighted words are propery indicating a ghigher value for 158, than the "km" "velocidade" and speed. Yey, we have our value: just take the higher value right? 

Not so fast: what if we have a composed word as a value?

In [13]:
line_number = 28489
line = weighted_words[weighted_words.index == line_number ]['index'].values[0]
line_weight = weighted_words[weighted_words.index == line_number].select_dtypes(include=[np.number]).T.sort_values(line_number, ascending=False)

print(line)
line_weight[line_weight[line_number] > 0]

5_0_Equipamento de 3_2_série 2_3_Banco 1_4_traseiro 0_5_rebatível


Unnamed: 0,28489
0_5_rebatível,0.504686
2_3_banco,0.480824
1_4_traseiro,0.480824
3_2_série,0.379142
5_0_equipamento,0.373042


Here, we have:
- the original imput as: "Equipamento de série Banco traseiro rebatível" (meaning: "Standard equipment Folding rear seat" )
- after the initial transformation: "5_0_Equipamento de 3_2_série 2_3_Banco 1_4_traseiro 0_5_rebatíve"
- and the higher value is "rebatível" meaning "foldin" 

but what about "rear" "seat", they too deserve to be a value!

It's clear here that we need a threshold, so let's check further this case.. 

In [14]:
line_weight[line_weight[line_number] > 0].describe()

Unnamed: 0,28489
count,5.0
mean,0.443704
std,0.062522
min,0.373042
25%,0.379142
50%,0.480824
75%,0.480824
max,0.504686


Ok, the mean seems a good threshold to filter the words: let's check with other cases.

In [15]:
line_number = 28410
line = weighted_words[weighted_words.index == line_number ]['index'].values[0]
line_weight = weighted_words[weighted_words.index == line_number].select_dtypes(include=[np.number]).T.sort_values(line_number, ascending=False)

print(line)
print(line_weight[line_weight[line_number] > 0])
line_weight[line_weight[line_number] > 0].describe()

4_0_Regime 3_1_potência 2_2_máx. 1_3_6000 0_4_rpm
                 28410
1_3_6000      0.542390
3_1_potência  0.446697
4_0_regime    0.410802
2_2_máx       0.410802
0_4_rpm       0.410802


Unnamed: 0,28410
count,5.0
mean,0.444299
std,0.056995
min,0.410802
25%,0.410802
50%,0.410802
75%,0.446697
max,0.54239


here we have:
- as the original text input: "Regime potência máx. 6000 rpm" (meaning: Max regime power. 6,000rpm )
- after transformation: 4_0_Regime 3_1_potência 2_2_máx. 1_3_6000 0_4_rpm
and with the words abose the mean value: "potência" and "6000", not quite what we are looking for,, but closer, indeed. 

So,, after a while of trial and error (and yes, RNNs could help here) I realized that the follwoing calculation is a good threshold:

threshold  = mean + standard deviation / count 

But no, I don't know why it works neither,, sorry guys.. 

Let's implement it programmatically

In [16]:
cutting_value = line_weight[line_weight[line_number] > 0].describe().loc['mean'] + line_weight[line_weight[line_number] > 0].describe().loc['std'] / line_weight[line_weight[line_number] > 0].describe().loc['count']
cutting_value = cutting_value.values[0]
cutting_value

0.45569763962547083

And now for all our files.. 

In [17]:

def get_top_words(line_weight, cutting_value):
    print(cutting_value)
    top_words = line_weight[line_weight[line_number] > cutting_value]
    #print(top_words.index.tolist())
    value_to_return = top_words.index.tolist()
    value_to_return.reverse()
    return value_to_return


In [18]:
def get_label_and_value(line, value_words):
    words = value_words
    pattern = r'({}[^\s]*).*?'.format('|'.join(words)) * len(words)

    # Find the string between the words
    match = re.search(pattern, line, re.DOTALL)
    #match = re.findall(pattern, line, re.DOTALL)

    print(match)


    if match:
        # get the value words
        result = match.group()
        #clean the string from position indicators _0, _1, _2, etc
        pattern = r'\b\w*_'
        result = re.sub(pattern, '', result)
        print(f"found value: {result}")

        #remove the result from line to get the label
        line = re.sub(pattern, '', line)
        label = line.replace(result, '')
        label = label.replace('  ', ' ') # remove double spaces

        #clean the string from position indicators _0, _1, _2, etc
        pattern = r'\b\w*_'
        label = re.sub(pattern, '', label)
        print("found label: " + label)



        return label, result

    else:
        print('No match found')


In [19]:
label_values_list = []

for line_number in weighted_words.index.tolist():
    print(f"line_number: {line_number}")

    line = weighted_words[weighted_words.index == line_number]['index'].values[0]

    if len(line.split()) > 1:
        print(line)

        #get line weight
        line_weight = weighted_words[weighted_words.index == line_number].select_dtypes(include=[np.number]).T.sort_values(line_number, ascending=False)

        #get cutting value =  mean + (standard deviation) divided by (count)
        cutting_value = line_weight[line_weight[line_number] > 0].describe().loc['mean'] + line_weight[line_weight[line_number] > 0].describe().loc['std'] / line_weight[line_weight[line_number] > 0].describe().loc['count']
        cutting_value = cutting_value.values[0]

        top_words = get_top_words(line_weight, cutting_value)
        label_values_list.append(get_label_and_value(line.lower(), top_words))


line_number: 0
line_number: 1
line_number: 2
line_number: 3
line_number: 4
line_number: 5
line_number: 6
line_number: 7
line_number: 8
line_number: 9
line_number: 10
line_number: 11
line_number: 12
line_number: 13
line_number: 14
line_number: 15
line_number: 16
line_number: 17
line_number: 18
line_number: 19
line_number: 20
line_number: 21
line_number: 22
line_number: 23
line_number: 24
line_number: 25
line_number: 26
line_number: 27
line_number: 28
line_number: 29
line_number: 30
line_number: 31
line_number: 32
line_number: 33
line_number: 34
line_number: 35
line_number: 36
line_number: 37
line_number: 38
line_number: 39
line_number: 40
line_number: 41
line_number: 42
line_number: 43
line_number: 44
line_number: 45
line_number: 46
line_number: 47
line_number: 48
line_number: 49
line_number: 50
line_number: 51
line_number: 52
line_number: 53
line_number: 54
line_number: 55
line_number: 56
line_number: 57
line_number: 58
line_number: 59
line_number: 60
line_number: 61
line_number: 62
li

line_number: 1033
line_number: 1034
line_number: 1035
line_number: 1036
line_number: 1037
line_number: 1038
line_number: 1039
line_number: 1040
line_number: 1041
line_number: 1042
line_number: 1043
line_number: 1044
line_number: 1045
line_number: 1046
line_number: 1047
line_number: 1048
line_number: 1049
line_number: 1050
line_number: 1051
line_number: 1052
line_number: 1053
line_number: 1054
line_number: 1055
line_number: 1056
line_number: 1057
line_number: 1058
line_number: 1059
line_number: 1060
line_number: 1061
line_number: 1062
line_number: 1063
line_number: 1064
line_number: 1065
line_number: 1066
line_number: 1067
line_number: 1068
line_number: 1069
line_number: 1070
line_number: 1071
line_number: 1072
line_number: 1073
line_number: 1074
line_number: 1075
line_number: 1076
line_number: 1077
line_number: 1078
line_number: 1079
line_number: 1080
line_number: 1081
line_number: 1082
line_number: 1083
line_number: 1084
line_number: 1085
line_number: 1086
line_number: 1087
line_numbe

# Refining the labels

Let's investigate what we have in our label - value list..

In [20]:
label_values_list_temp = label_values_list.copy()
label_values_list_temp = [x for x in label_values_list_temp if x is not None]
i = 0
for label, value in label_values_list_temp:
    #removing stop words from label
    label_words = label.split()
    label_words = [f"{word}" if word not in stopwords.words('portuguese') else '' for i, word in enumerate(label_words)]
    label = ' '.join(label_words)
    #remove double spaces
    label = label.replace('  ', ' ')
    label_values_list_temp[i] = (label, value)
    i+=1
label_values_list_temp
label_values_temp_df = pd.DataFrame(label_values_list_temp, columns=["label", "value"])
label_values_df = label_values_temp_df.copy()
label_values_df.head(10)

Unnamed: 0,label,value
0,renault scenic expression 1.6 16v,
1,ano,2007
2,preço,16.975
3,propulsão combustão,
4,combustível,flex
5,procedência,nacional
6,garantia,1 ano
7,configuração,minivan
8,porte,médio
9,5,lugares


Quite good but.. did you realized that "lugares" (places) is being considered as a value instead of a label when there are 5 places ('5', 'lugares') ?
Let'refine our list.. 

# Filtering found labels

Is understandable that labels, contrary to values, should not contain numbers, so let remove them.

In [21]:
# we choose to keep labels that does not contain numbers

def remove_numbers(text):
    words =  text.split()
    filtered_words = [word for word in words if re.match(r'\D*$', word)]
    filtered_text = ' '.join(filtered_words)
    #print(filtered_text)
    return(filtered_text)


And keep only the relevant labels.. 

In [22]:
label_values_df['filtered_label'] = label_values_df['label'].apply(lambda x: remove_numbers(x))
label_values_df = label_values_df[label_values_df["filtered_label"] != "" ]  #.dropna(subset=['filtered_label'])
label_values_df[['label', 'filtered_label']]  .head(20)

Unnamed: 0,label,filtered_label
0,renault scenic expression 1.6 16v,renault scenic expression
1,ano,ano
2,preço,preço
3,propulsão combustão,propulsão combustão
4,combustível,combustível
5,procedência,procedência
6,garantia,garantia
7,configuração,configuração
8,porte,porte
10,portas,portas


Now let's order by their frequency (how often those labels appear in our data)

In [23]:
label_values_df["filtered_label_frequency"] = label_values_df.groupby('filtered_label')['filtered_label'].transform('count')
label_freq_df = label_values_df
label_freq_df.sort_values(by="filtered_label_frequency", ascending=False, inplace=True)
label_freq_df.drop_duplicates(subset="filtered_label", keep='first', inplace=True)


In [24]:
pd.set_option('display.max_rows', 200)
label_freq_df[["filtered_label", "filtered_label_frequency"]].head(20)

Unnamed: 0,filtered_label,filtered_label_frequency
1117,equipamento série,383
578,altura flanco mm,46
1916,elemento elástico mola helicoidal,38
1125,equipamento opcional,37
1292,preço,23
1291,ano,23
1297,configuração,23
1288,rodoviária km,23
1286,urbana km,23
1284,rodoviário km/l,23


In [25]:
label_freq_df[["filtered_label", "filtered_label_frequency"]].tail(20)

Unnamed: 0,filtered_label,filtered_label_frequency
227,mm,2
226,ângulo central graus,2
1535,alimentação injeção direta indireta,2
1453,série,2
1752,honda fit lx at,2
783,tração integral sob demanda,2
1912,acoplamento embreagem dupla banhada óleo,2
2053,chery celer act,1
2229,graus,1
79,toyota hilux turbo cs,1


Sounds good but.. Did you realized that "Equipamento de Série" (meaning "Standard Equipment") has _much more_ occurence that the rest of the labels? Also, me may need to apply a threachold for labels rarely occurs (and I likyly irrelevant).
This is understandable, since a car have several 'Standard Equipment' (as well as Optional Equipment) so we should treat those labels differently.. what about considerering them as "unit": 1 for standard equipment and 0.5 for optionals. 
Ok, let's retreive their labels (or sub-labels) then.. 

In [26]:
label_values_df = label_values_temp_df.copy()
st_equipment_labels_df =   label_values_df[label_values_df["label"] == "equipamento série" ].drop_duplicates(subset="value", keep='first').head(20)
st_equipment_labels_df

Unnamed: 0,label,value
238,equipamento série,freios abs
239,equipamento série,airbags frontais
240,equipamento série,airbags laterais
241,equipamento série,alarme antifurto perimétrico
242,equipamento série,alarme antifurto volumétrico
243,equipamento série,câmera traseira para manobras
245,equipamento série,encosto de cabeça para todos ocupantes
246,equipamento série,controle de estabilidade
247,equipamento série,controle de tração
248,equipamento série,faróis de xenônio


In [27]:
label_values_df = label_values_temp_df.copy()
optional_eq_labels_df =   label_values_df[label_values_df["label"] == "equipamento opcional" ].drop_duplicates(subset="value", keep='first').head(20)
optional_eq_labels_df

Unnamed: 0,label,value
905,equipamento opcional,alarme antifurto perimétrico
906,equipamento opcional,alarme antifurto volumétrico
913,equipamento opcional,sensores de estacionamento traseiro
914,equipamento opcional,
916,equipamento opcional,ar quente
920,equipamento opcional,controle elétrico dos vidros traseiros
921,equipamento opcional,ajuste elétrico dos retrovisores
922,equipamento opcional,rodas de liga leve
927,equipamento opcional,cd player
928,equipamento opcional,conexão usb


Ok, now let add all our potential labels and remove the duplicates. 

In [28]:
all_labels_list = st_equipment_labels_df["value"].tolist() + optional_eq_labels_df["value"].tolist() + label_freq_df["filtered_label"].tolist()
all_labels_list_df = pd.DataFrame(all_labels_list, columns=["label"])
all_labels_list_df.head(10)

Unnamed: 0,label
0,freios abs
1,airbags frontais
2,airbags laterais
3,alarme antifurto perimétrico
4,alarme antifurto volumétrico
5,câmera traseira para manobras
6,encosto de cabeça para todos ocupantes
7,controle de estabilidade
8,controle de tração
9,faróis de xenônio


In [29]:
all_labels_list_df.tail(10)

Unnamed: 0,label
160,volkswagen amarok extreme at cd
161,comando válvulas único
162,chevrolet corsa sedan joy
163,equipamento série motorista
164,chevrolet high country turbo at cd
165,volvo
166,plataforma ford
167,"indep., mcpherson"
168,cadillac srx
169,renault scenic expression


Ohh yes, we forgot: we should take off non frequent labels!... 

In [30]:
all_labels_list = st_equipment_labels_df["value"].tolist() + optional_eq_labels_df["value"].tolist() + label_freq_df[label_freq_df["filtered_label_frequency"] > 1]["filtered_label"].tolist()
all_labels_list_df = pd.DataFrame(all_labels_list, columns=["label"])
all_labels_list_df = all_labels_list_df[all_labels_list_df['label'] != "" ].drop_duplicates()
pd.set_option('display.max_rows', 40)
all_labels_list_df

Unnamed: 0,label
0,freios abs
1,airbags frontais
2,airbags laterais
3,alarme antifurto perimétrico
4,alarme antifurto volumétrico
...,...
152,alimentação injeção direta indireta
153,série
154,honda fit lx at
155,tração integral sob demanda


# Filling the table 

Now that we have the label, let's make our data frame and fill the table with the technical specs of every car.

First, let's create the dataframe.

In [37]:
labels_list = all_labels_list_df["label"].tolist()
labels_list.append("file_name")
cars_specs_df = pd.DataFrame(columns=labels_list )
cars_specs_df

Unnamed: 0,freios abs,airbags frontais,airbags laterais,alarme antifurto perimétrico,alarme antifurto volumétrico,câmera traseira para manobras,encosto de cabeça para todos ocupantes,controle de estabilidade,controle de tração,faróis de xenônio,...,equipamento série luz,chevrolet cs,mm,ângulo central graus,alimentação injeção direta indireta,série,honda fit lx at,tração integral sob demanda,acoplamento embreagem dupla banhada óleo,file_name


In [63]:
text_content = []
car_specs_df_model = cars_specs_df.copy()
for file in text_files:
    car_specs_dic = {key: None for key in labels_list}

    with open(f"{file_path}{file}", 'r') as file:
        content = file.read()
        try:
            content = content.split('Ficha Técnica')[1]
            content = content.split('Fotos')[0]
            content = content.split('\n') # split the content by lines
        except:
            pass
    temp = []

    for line in content:
        #doing some basic cleaning
        #pass
        # Remove words in between brackets
        line = re.sub(r'\[.*?\]', ' ', line)

        words = line.split()
        # Remove words that contain a special character
        words = [word for word in words if all(  c == '\n' or c == '/'  or c =='.' or c == ',' for c in string.punctuation if c in word)]

        #remove stopwords from the line
        words = [word for word in words if word not in stopwords.words('portuguese')]
        cleaned_line = ' '.join(words).lower()
        cleaned_line = cleaned_line.replace('  ', ' ').strip()

        #considering the standard or optional equipment as 1 or 0.5
        equipement_value = ""

        if cleaned_line.find("equipamento série") != -1:
            equipement_value = "1"
            cleaned_line = cleaned_line.replace("equipamento série", "").strip()
        elif cleaned_line.find("equipamento opcional") != -1:
            equipement_value = "0.5"
            cleaned_line = cleaned_line.replace("equipamento opcional", "").strip()

        if equipement_value != "":
            if cleaned_line in labels_list:
                car_specs_dic[cleaned_line] = equipement_value

        else:
            # tokenize the text and the labels into words
            line_words = Counter(cleaned_line.lower().split())
            label_words = [Counter(label.lower().split()) for label in labels_list]

            # count the number of common words between the line and each label
            common_counts = [sum((line_words & label_word).values()) for label_word in label_words]

            # find the label with the most matched words of the line
            max_index = common_counts.index(max(common_counts))
            most_matched_label = labels_list[max_index]

            if max_index > 0:

                value_to_keep = cleaned_line.lower()
                for word in most_matched_label.split():
                    value_to_keep = value_to_keep.replace(word, '')
                    value_to_keep = value_to_keep.replace('  ', ' ').strip()


                car_specs_dic[most_matched_label] = value_to_keep

    if all(value is  None for value in car_specs_dic.values()):
        pass  #we do not have any information about the car
    else:
        car_specs_dic['file_name'] = os.path.basename(f"{file_path}{file}").split('.')[0]
        cars_specs_df = pd.concat([cars_specs_df, pd.DataFrame([car_specs_dic])], ignore_index=True)

        print(car_specs_dic)


{'freios abs': None, 'airbags frontais': None, 'airbags laterais': None, 'alarme antifurto perimétrico': None, 'alarme antifurto volumétrico': None, 'câmera traseira para manobras': None, 'encosto de cabeça para todos ocupantes': None, 'controle de estabilidade': None, 'controle de tração': None, 'faróis de xenônio': None, 'faróis com regulagem de altura': None, 'faróis com refletores duplos': None, 'faróis de neblina': None, 'luz traseira de neblina': None, 'repetidores laterais das luzes de direção': None, 'travamento central das portas': None, 'luz de condução diurna': None, 'controle automático de descida': None, 'desembaçador do vidro traseiro': None, 'isofix para fixação de cadeira infantil': None, 'sensores de estacionamento traseiro': None, 'ar quente': None, 'controle elétrico dos vidros traseiros': None, 'ajuste elétrico dos retrovisores': None, 'rodas de liga leve': None, 'cd player': None, 'conexão usb': None, 'conexão bluetooth': None, 'volante multifuncional': None, 'comp

Unnamed: 0,freios abs,airbags frontais,airbags laterais,alarme antifurto perimétrico,alarme antifurto volumétrico,câmera traseira para manobras,encosto de cabeça para todos ocupantes,controle de estabilidade,controle de tração,faróis de xenônio,...,equipamento série luz,chevrolet cs,mm,ângulo central graus,alimentação injeção direta indireta,série,honda fit lx at,tração integral sob demanda,acoplamento embreagem dupla banhada óleo,file_name
0,,,,,,,,,,,...,,,,,,,,,,22397
1,,,,,,,,,dianteira,,...,,,,,,,,,,16273
2,,,,,,eixo rígido,,,,,...,,toyota hilux 2.5 turbo 4x4,,,,,,,,13249
3,1.0,1.0,1.0,1.0,1.0,eixo rígido,,,,,...,,,,23.0,,,,,,21550
4,,,,,,,,,,,...,,,,,,,,,,22013
5,,,,,,,,,,,...,,,,,,,,,,23487
6,,,,,,eixo torção,,,dianteira,,...,,,,,,,,,,14521
7,,,,,,eixo torção,,,dianteira,,...,,,,,,,,,,14910
8,,,,,,,,,,,...,,,,,,,,,,934
9,,,,,,eixo torção,,,dianteira,,...,,,,,,,1.4 8v,,,14426


In [65]:
cars_specs_df.tail()

Unnamed: 0,freios abs,airbags frontais,airbags laterais,alarme antifurto perimétrico,alarme antifurto volumétrico,câmera traseira para manobras,encosto de cabeça para todos ocupantes,controle de estabilidade,controle de tração,faróis de xenônio,...,equipamento série luz,chevrolet cs,mm,ângulo central graus,alimentação injeção direta indireta,série,honda fit lx at,tração integral sob demanda,acoplamento embreagem dupla banhada óleo,file_name
25,,,,,,,,,dianteira,,...,,,,,,,,,,3034
26,1.0,1.0,,1.0,,eixo torção,,,dianteira,,...,,,,,,,1.5,,,2039
27,1.0,1.0,1.0,1.0,1.0,,,,dianteira,,...,,,,,in,,,,,4564
28,1.0,1.0,,1.0,,eixo torção,,,dianteira,,...,,,,,,,,,,5563
29,,,,,,eixo rígido,,,,,...,,s10 de luxe 2.5 turbo,,,,,,,,15147


# Final considerations

In this example, we explored how to structure a series of semi-structured text files into a structured table using a TfId technique coupled with a common count tokenizer technique to fill the table, although these techniques may work well in our case, is possible to improve this technique with an RNN model for supervised learning in order to better determine the threshold to be used by the system, but we will leave this for another time, after all, all article should reach an end. 

Voilá!