<a href="https://colab.research.google.com/github/gyimesbalint/DL_project/blob/main/dataimport.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Importing datasets

In [1]:
# https://drive.google.com/file/d/1pmNSD1nbYHEAiP065s4akRXHMWFs9Dqw/view?usp=sharing DBpedia train.csv
# https://drive.google.com/file/d/1mKededzdbJsWQnwsu-R_WSILYSvNEY7c/view?usp=sharing DBpedia test.csv
!pip install gdown 
!gdown --id 1pmNSD1nbYHEAiP065s4akRXHMWFs9Dqw --output train.csv  #import train.csv from drive
!gdown --id 1mKededzdbJsWQnwsu-R_WSILYSvNEY7c --output test.csv   #import test.csv from drive

Downloading...
From: https://drive.google.com/uc?id=1pmNSD1nbYHEAiP065s4akRXHMWFs9Dqw
To: /content/train.csv
100% 174M/174M [00:01<00:00, 156MB/s]
Downloading...
From: https://drive.google.com/uc?id=1mKededzdbJsWQnwsu-R_WSILYSvNEY7c
To: /content/test.csv
100% 21.8M/21.8M [00:00<00:00, 132MB/s]


In [2]:
import numpy as np
import pandas as pd

train_data = pd.read_csv('train.csv', encoding='utf8',header=None) #read csv to dataframe
test_data = pd.read_csv('test.csv', encoding='utf8',header=None)   #read csv to dataframe

train_data.where(train_data[0] < 6, inplace = True)  #select first 5 categories
train_data = train_data[train_data[0].notnull()]     #remove NaN values

test_data.where(test_data[0] < 6, inplace = True)    #select first 5 categories
test_data = test_data[test_data[0].notnull()]        #remove NaN values

In [3]:
train_data.sample(5) #sample from train data

Unnamed: 0,0,1,2
694,1.0,CITS Group Corporation,The CITS Group Corporation (Chinese: 中国国旅集团有限...
187714,5.0,Mostafa Chamran,Mostafa Chamran Savei (8 March 1932 – 20 June...
134,1.0,Beyerdynamic,Beyerdynamic (stylized beyerdynamic) GmbH & C...
170358,5.0,Chris Koster,Chris Koster (born August 31 1964) is an Amer...
128179,4.0,Alexandre Picard (ice hockey),Alexandre Picard (born October 9 1985) is a C...


In [4]:
test_data.sample(5) #sample from test data

Unnamed: 0,0,1,2
24380,5.0,Derek Hodge,Derek M. Hodge (October 5 1941 – May 31 2011)...
4372,1.0,Institute for OneWorld Health,OneWorld Health is a 501(c)(3) nonprofit drug...
6637,2.0,Episcopal School of Acadiana,Episcopal School of Acadiana (ESA) is a coedu...
6527,2.0,Hicksville High School (Ohio),Hicksville High School is a public high schoo...
12951,3.0,Burton Cummings,Burton L. Cummings OC OM (born December 31 19...


In [5]:
train_label = pd.to_numeric(train_data.iloc[:,0]) #select labels (int) from train data
train_text = train_data.iloc[:,1:3] #select article text from train data

In [6]:
test_label = pd.to_numeric(test_data.iloc[:,0]) #select labels (int) from test data
test_text = test_data.iloc[:,1:3] #select article text from train data

#Pre-processing

In [7]:
from sklearn.feature_extraction.text import CountVectorizer #vectorizer for article text data
from nltk.corpus import stopwords #stopwords for desktop usage
import nltk
nltk.download('stopwords') #stopwords for collab notebook

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
vectorizer = CountVectorizer(stop_words=stopwords.words('english'), analyzer='word', ngram_range=(1, 1)) #vectorizer for words, removing stopwords

In [9]:
test_title = test_text.iloc[:,0] #select titles from test text
test_desc = test_text.iloc[:,1]  #select descriptions from test text

In [10]:
test_title = pd.DataFrame.sparse.from_spmatrix(vectorizer.fit_transform(test_title)) #vectorizing test_title and adding it to a dataframe
test_title_feature_names = np.asarray(vectorizer.get_feature_names()) #getting all feature names for test_title
test_desc = pd.DataFrame.sparse.from_spmatrix(vectorizer.fit_transform(test_desc)) #vectorizing test_desc and adding it to a dataframe
test_desc_feature_names = np.asarray(vectorizer.get_feature_names()) #getting all feature names for test_desc

In [11]:
#First 5 rows and feature names obtained with the vectoritzer
print(test_title[0:5])
print(test_title_feature_names[0:5])

   0      1      2      3      4      ...  25333  25334  25335  25336  25337
0      0      0      0      0      0  ...      0      0      0      0      0
1      0      0      0      0      0  ...      0      0      0      0      0
2      0      0      0      0      0  ...      0      0      0      0      0
3      0      0      0      0      0  ...      0      0      0      0      0
4      0      0      0      0      0  ...      0      0      0      0      0

[5 rows x 25338 columns]
['0verflow' '100' '102nd' '105th' '110th']


In [12]:
#First 5 rows and feature names obtained with the vectoritzer
print(test_desc[0:5])
print(test_desc_feature_names[0:5])

   0      1      2      3      4      ...  77195  77196  77197  77198  77199
0      0      0      0      0      0  ...      0      0      0      0      0
1      0      0      0      0      0  ...      0      0      0      0      0
2      0      0      0      0      0  ...      0      0      0      0      0
3      0      0      0      0      0  ...      0      0      0      0      0
4      0      0      0      0      0  ...      0      0      0      0      0

[5 rows x 77200 columns]
['00' '000' '00027' '000338sz' '000629']


In [13]:
train_title = train_text.iloc[:,0] #select titles from train text
train_desc = train_text.iloc[:,1]  #select descriptions from train text

In [14]:
train_title = pd.DataFrame.sparse.from_spmatrix(vectorizer.fit_transform(train_title)) #vectorizing train_title and adding it to a dataframe
train_title_feature_names = np.asarray(vectorizer.get_feature_names()) #getting all feature names for train_title
train_desc = pd.DataFrame.sparse.from_spmatrix(vectorizer.fit_transform(train_desc)) #vectorizing train_desc and adding it to a dataframe
train_desc_feature_names = vectorizer.get_feature_names() #getting all feature names for train_desc

In [15]:
#First 5 rows and feature names obtained with the vectoritzer
print(train_title[0:5])
print(train_title_feature_names[0:5])

   0       1       2       3       ...  117693  117694  117695  117696
0       0       0       0       0  ...       0       0       0       0
1       0       0       0       0  ...       0       0       0       0
2       0       0       0       0  ...       0       0       0       0
3       0       0       0       0  ...       0       0       0       0
4       0       0       0       0  ...       0       0       0       0

[5 rows x 117697 columns]
['002' '05' '07' '09' '10']


In [16]:
#First 5 rows and feature names obtained with the vectoritzer
print(train_desc[0:5])
print(train_desc_feature_names[0:5])

   0       1       2       3       ...  293357  293358  293359  293360
0       0       0       0       0  ...       0       0       0       0
1       0       0       0       0  ...       0       0       0       0
2       0       0       0       0  ...       0       0       0       0
3       0       0       0       0  ...       0       0       0       0
4       0       0       0       0  ...       0       0       0       0

[5 rows x 293361 columns]
['00', '000', '0000', '00000006', '000001']


In [17]:
from scipy.sparse import csr_matrix #for min-max scaling sparse matrices

In [18]:
def normalize(df): #function for min-max scaling of dataframes
    result = df.copy()
    for feature_name in df.columns:
        max_value = csr_matrix(df[feature_name]).max()
        min_value = csr_matrix(df[feature_name]).min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

In [19]:
test_title = normalize(test_title) #min-max scale test_title

In [20]:
print(test_title[0:5])

   0      1      2      3      4      ...  25333  25334  25335  25336  25337
0    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
1    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
2    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
3    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
4    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0

[5 rows x 25338 columns]


In [21]:
test_desc = normalize(test_desc) #min-max scale test_desc

In [22]:
print(test_desc[0:5])

   0      1      2      3      4      ...  77195  77196  77197  77198  77199
0    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
1    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
2    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
3    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
4    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0

[5 rows x 77200 columns]


In [23]:
train_title = normalize(train_title) #min-max scale train_title

In [24]:
print(train_title[0:5])

   0       1       2       3       ...  117693  117694  117695  117696
0     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
1     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
2     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
3     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
4     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0

[5 rows x 117697 columns]


In [25]:
train_desc = normalize(train_desc) #min-max scale train_desc

In [26]:
print(train_desc[0:5])

   0       1       2       3       ...  293357  293358  293359  293360
0     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
1     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
2     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
3     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0
4     0.0     0.0     0.0     0.0  ...     0.0     0.0     0.0     0.0

[5 rows x 293361 columns]


Our **X** will be the sparse matrices and the feature names from the title and the description, and **Y** will be the labels provided in the csv file.

Validation split is obtained later in the model.fit() function from training data