As part of the** Amazon ML challenge**  2023, I along with my team developed a machine learning model to predict product length from catalog metadata. The objective of the challenge was to facilitate efficient packaging and storage of products in the warehouse.

The training and testing data consisted of 2.2 million products, each with a unique product ID, title, description, bullet points, product type ID, and product length. My task was to build a model that could accurately predict the product length using these metadata features, despite the presence of noise in the data.

To evaluate the performance of my model, we used the mean root mean square error, and the score was calculated as r square.
For submission, we created a .csv file with the index set as the product ID and the target variable as the predicted product length. The submission file had to be of size 734736 x 2 and contain the correct index values and column names as provided in the sample submission file.

Link for hakcthon
https://www.hackerearth.com/challenges/competitive/amazon-ml-challenge-2023/machine-learning/product-length-prediction-7-85b7ef50/

Overall, this challenge provided me with the opportunity to develop my machine learning skills and apply them to a real-world problem faced by an industry-leading company like Amazon.


#libraries and dataset

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import re

In [None]:
!wget https://s3-ap-southeast-1.amazonaws.com/he-public-data/datasetb2d9982.zip

--2023-04-23 19:02:25--  https://s3-ap-southeast-1.amazonaws.com/he-public-data/datasetb2d9982.zip
Resolving s3-ap-southeast-1.amazonaws.com (s3-ap-southeast-1.amazonaws.com)... 52.219.132.10
Connecting to s3-ap-southeast-1.amazonaws.com (s3-ap-southeast-1.amazonaws.com)|52.219.132.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 895569552 (854M) [binary/octet-stream]
Saving to: ‘datasetb2d9982.zip.1’


2023-04-23 19:03:14 (17.8 MB/s) - ‘datasetb2d9982.zip.1’ saved [895569552/895569552]



In [None]:
!unzip datasetb2d9982.zip.1

Archive:  datasetb2d9982.zip.1
   creating: dataset/
  inflating: dataset/sample_submission.csv  
  inflating: dataset/train.csv       
  inflating: dataset/test.csv        


In [None]:
train_df = pd.read_csv('dataset/train.csv')
test_df = pd.read_csv('dataset/test.csv')

#understanding

In [None]:
train_df.shape

(2249698, 6)

In [None]:
test_df.shape

(734736, 5)

In [None]:
train_df.head(5)

Unnamed: 0,PRODUCT_ID,TITLE,BULLET_POINTS,DESCRIPTION,PRODUCT_TYPE_ID,PRODUCT_LENGTH
0,1925202,ArtzFolio Tulip Flowers Blackout Curtain for D...,[LUXURIOUS & APPEALING: Beautiful custom-made ...,,1650,2125.98
1,2673191,Marks & Spencer Girls' Pyjama Sets T86_2561C_N...,"[Harry Potter Hedwig Pyjamas (6-16 Yrs),100% c...",,2755,393.7
2,2765088,PRIKNIK Horn Red Electric Air Horn Compressor ...,"[Loud Dual Tone Trumpet Horn, Compatible With ...","Specifications: Color: Red, Material: Aluminiu...",7537,748.031495
3,1594019,ALISHAH Women's Cotton Ankle Length Leggings C...,[Made By 95%cotton and 5% Lycra which gives yo...,AISHAH Women's Lycra Cotton Ankel Leggings. Br...,2996,787.401574
4,283658,The United Empire Loyalists: A Chronicle of th...,,,6112,598.424


In [None]:
test_df.head(5)

Unnamed: 0,PRODUCT_ID,TITLE,BULLET_POINTS,DESCRIPTION,PRODUCT_TYPE_ID
0,604373,Manuel d'Héliogravure Et de Photogravure En Re...,,,6142
1,1729783,DCGARING Microfiber Throw Blanket Warm Fuzzy P...,[QUALITY GUARANTEED: Luxury cozy plush polyest...,<b>DCGARING Throw Blanket</b><br><br> <b>Size ...,1622
2,1871949,I-Match Auto Parts Front License Plate Bracket...,"[Front License Plate Bracket Made Of Plastic,D...",Replacement for The Following Vehicles:2020 LE...,7540
3,1107571,PinMart Gold Plated Excellence in Service 1 Ye...,[Available as a single item or bulk packed. Se...,Our Excellence in Service Lapel Pins feature a...,12442
4,624253,"Visual Mathematics, Illustrated by the TI-92 a...",,,6318


#steps performed everytime


In [None]:
##imp step
train_df = train_df.sample(n=20000, random_state=123, replace = True)
#test_df = test_df.sample(n=20000, random_state=123, replace = True)


In [None]:
train_df = train_df[train_df['PRODUCT_LENGTH'] >= 0]
train_df['PRODUCT_LENGTH'] = np.log1p(train_df['PRODUCT_LENGTH'])

In [None]:
#Handle missing values
train_df.fillna("unknown", inplace=True)
#test_df.fillna("unknown", inplace=True)

In [None]:
#Combine relevant text data ('TITLE', 'DESCRIPTION', 'BULLET_POINTS') from train and test data
combined_text = train_df['TITLE'] + " " + train_df['DESCRIPTION'] + " " + train_df['BULLET_POINTS']
#combined_text_test = test_df['TITLE'] + " " + test_df['DESCRIPTION'] + " " + test_df['BULLET_POINTS']

#word2vec

In [None]:
from gensim.models import Word2Vec
import nltk
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Tokenize the text
tokenized_text = [nltk.word_tokenize(text) for text in combined_text_train]

# Train the Word2Vec model
model = Word2Vec(tokenized_text, min_count=1)

# Create an empty numpy array to hold the word vectors
word_vectors = np.zeros((len(tokenized_text), model.vector_size))

# Convert each document to a vector representation using the trained Word2Vec model
for i, doc in enumerate(tokenized_text):
    vector = np.zeros(model.vector_size)
    num_words = 0
    for word in doc:
        if word in model.wv:
            vector += model.wv[word]
            num_words += 1
    if num_words > 0:
        vector /= num_words
    word_vectors[i] = vector

# Split the data into training and testing sets
train_size = int(0.8 * len(word_vectors))
x_train = word_vectors[:train_size]
x_test = word_vectors[train_size:]
y_train = train_df['PRODUCT_LENGTH'][:train_size]
y_test = train_df['PRODUCT_LENGTH'][train_size:]

# Train a linear regression model on the training data
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

# Calculate the RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)

# Evaluate the model on the test set
y_pred = model.predict(x_test)
accuracy = model.score(x_test, y_test)


X_train = x_train
X_test = x_test


NameError: ignored

In [None]:
print(accuracy)

#doc2vec

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Tokenize the combined text and create tagged documents
tagged_docs = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[i]) for i, doc in enumerate(combined_text_train)]

# Train the doc2vec model
model = Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=5, epochs=20)

# Get document vectors for the training data
doc_vectors_train = [model.infer_vector(doc.words) for doc in tagged_docs]


In [None]:
import nltk

# Tokenize the text
tokenized_text = [nltk.word_tokenize(text) for text in combined_text_train]


In [None]:
import numpy as np

# Extract word vectors for all words in the training data
word_vectors_train = []
for sentence in tokenized_text:
    sentence_vectors = []
    for word in sentence:
        try:
            word_vector = model.wv[word]
            sentence_vectors.append(word_vector)
        except KeyError:
            # Ignore out-of-vocabulary words
            pass
    word_vectors_train.append(sentence_vectors)

# Pad or truncate all sentences to a fixed length
max_length = 50
x_train = np.zeros((len(word_vectors_train), max_length, model.vector_size))
for i, sentence_vectors in enumerate(word_vectors_train):
    padded_vectors = np.zeros((max_length, model.vector_size))
    padded_vectors[:len(sentence_vectors), :] = sentence_vectors[:max_length]
    x_train[i, :, :] = padded_vectors


In [None]:
X_train = doc_vectors_train

In [None]:
y_train = train_df['PRODUCT_LENGTH'].values

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub("[^a-zA-Z]", " ", text)
    
    # Convert to lower case
    text = text.lower()
    
    # Tokenize and lemmatize the text
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    text = " ".join([lemmatizer.lemmatize(token) for token in tokens])
    
    return text


In [None]:
#Define a function for text preprocessing
def preprocess_text(text):
    # Remove special characters and digits
    text = re.sub("[^a-zA-Z]", " ", text)
    
    # Convert to lower case
    text = text.lower()
    
    # Tokenize and stem the text
    ps = PorterStemmer()
    tokens = word_tokenize(text)
    text = " ".join([ps.stem(token) for token in tokens])
    
    return text

In [None]:
import nltk

In [None]:
nltk.download('punkt')
nltk.download('wordnet')


In [None]:
#Preprocess the combined text data
preprocessed_text_train = combined_text_train.apply(preprocess_text)
#preprocessed_text_test = combined_text_test.apply(preprocess_text)

In [None]:
#Vectorize the preprocessed text using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=10000)
X_train_text = vectorizer.fit_transform(preprocessed_text_train)
#X_test_text = vectorizer.fit_transform(preprocessed_text_test)


In [None]:
X_train = X_train_text.toarray()
#X_test_text_array = X_test_text.toarray()


In [None]:
type(X_train)

In [None]:
y_train = train_df['PRODUCT_LENGTH'].values

In [None]:
type(y_train)

In [None]:
import numpy as np

# Assuming your input array is called 'input_array'
input_array_flattened = np.flatten(X_train_text_array)


In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
product_type_id_encoded = encoder.fit_transform(train_df['PRODUCT_TYPE_ID'])


In [None]:
product_type_id_encoded_2d = product_type_id_encoded.reshape(-1, 1)
print(product_type_id_encoded_2d.shape)
print(X_train_text_array.shape)


In [None]:
from scipy.sparse import hstack

# Concatenate X_train_text_array and product_type_id_encoded
X_train = hstack((product_type_id_encoded.reshape(-1, 1),X_train_text_array.T ))

# Convert X_train to a dense numpy array
X_train = X_train.toarray()

# Reshape y_train to a 2D numpy array
y_train = train_df['PRODUCT_LENGTH'].values.reshape(-1, 1)


In [None]:
from scipy.sparse import hstack

# Concatenate X_train_text_array and product_type_id_encoded
X_train_encoded = hstack((X_train_text_array, product_type_id_encoded.reshape(-1, 1)))

# Convert X_train_encoded to a dense numpy array
X_train = X_train_encoded.toarray()

y_train = train_df['PRODUCT_LENGTH'].values


In [None]:
from scipy.sparse import hstack

# Concatenate X_train_text_array and product_type_id_encoded
X_train = hstack((X_train_text_array, product_type_id_encoded.reshape(-1, 1)))

# Convert X_train to a dense numpy array
X_train = X_train.toarray()
y_train = train_data['PRODUCT_LENGTH'].values

In [None]:
# Prepare the input data for the machine learning model
X_train = pd.concat([pd.DataFrame(X_train_text.toarray()), train_data['PRODUCT_TYPE_ID'].reset_index(drop=True)], axis=1)
X_val = pd.concat([pd.DataFrame(X_val_text.toarray()), validation_data['PRODUCT_TYPE_ID'].reset_index(drop=True)], axis=1)
y_train = train_data['PRODUCT_LENGTH']
y_val = validation_data['PRODUCT_LENGTH']

In [None]:
from sklearn.linear_model import LinearRegression

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("RMSE:", rmse)

In [None]:
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)


#TF-IDF

In [None]:
train_set, val_set = train_test_split(train_df, test_size=0.2, random_state=123)

In [None]:
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(train_set['TITLE'].fillna('') + ' ' + train_set['DESCRIPTION'].fillna('') + ' ' + train_set['BULLET_POINTS'].fillna(''))
val_features = vectorizer.transform(val_set['TITLE'].fillna('') + ' ' + val_set['DESCRIPTION'].fillna('') + ' ' + val_set['BULLET_POINTS'].fillna(''))

In [None]:
# define target variable
target_col = 'PRODUCT_LENGTH'

# extract x_train and y_train
X_train = train_features
y_train = train_set[target_col]

# extract x_test and y_test
X_test = val_features
y_test = val_set[target_col]

#tf idf try

In [None]:
# Split the data into training and validation sets
train_df, test_df = train_test_split(train_df, test_size=0.2, random_state=42)

# Vectorize the combined_text column using TF-IDF
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(combined_text)
X_test = vectorizer.transform(combined_text)

In [None]:
X_train.toarray()
X_test.toarray()

In [None]:
#first split the data into training and validation sets and then develop a machine learning model to predict the product length dimension. 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression

# Load the training data
data = pd.read_csv('dataset/preprocessed_train.csv')

##imp step
data = data.sample(n=20000, random_state=123, replace = True)

# Preprocess the data: combine TITLE, DESCRIPTION, and BULLET_POINTS into a single text feature
data['combined_text'] = data['TITLE'].fillna('') + ' ' + data['DESCRIPTION'].fillna('') + ' ' + data['BULLET_POINTS'].fillna('')
data = data[['combined_text', 'PRODUCT_TYPE_ID', 'PRODUCT_LENGTH']]

# Split the data into training and validation sets
train_data, validation_data = train_test_split(data, test_size=0.2, random_state=42)

# Vectorize the combined_text column using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)
X_train_text = vectorizer.fit_transform(train_data['combined_text'])
X_val_text = vectorizer.transform(validation_data['combined_text'])

# Prepare the input data for the machine learning model
X_train = pd.concat([pd.DataFrame(X_train_text.toarray()), train_data['PRODUCT_TYPE_ID'].reset_index(drop=True)], axis=1)
X_val = pd.concat([pd.DataFrame(X_val_text.toarray()), validation_data['PRODUCT_TYPE_ID'].reset_index(drop=True)], axis=1)
y_train = train_data['PRODUCT_LENGTH']
y_val = validation_data['PRODUCT_LENGTH']

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Validate the model on the validation set
y_pred_val = model.predict(X_val)

#calculate accuracy
accuracy = model.score(X_val, y_val)
print("accuracy:", accuracy)

#Concatenation skip

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# Extract the vectors from column A into a numpy array
X_a = np.stack(df['A'].values)

# Extract the integer values from column B into a numpy array
X_b = df['B'].values.reshape(-1, 1)

# Concatenate the two numpy arrays along axis 1 to create the final feature matrix
X_train = np.concatenate([X_a, X_b], axis=1)

# Extract the target variable into a numpy array
y_train = df['C'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


#new start skip

In [None]:
#To preprocess the dataset efficiently and ensure the code runs on Google Colab without crashing, follow the steps below:

#1. Download the dataset.
#2. Load train.csv and test.csv files.
#3. Handle missing values.
#4. Remove noise.
#5. Normalize the data.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer

# Download and unzip the dataset
#!wget https://s3-ap-southeast-1.amazonaws.com/he-public-data/datasetb2d9982.zip
#!unzip datasetb2d9982.zip

# Load the train.csv and test.csv files
train_data = pd.read_csv("dataset/train.csv", index_col="PRODUCT_ID")
test_data = pd.read_csv("dataset/test.csv", index_col="PRODUCT_ID")

# Combine the text columns into a single column
train_data["combined_text"] = train_data["TITLE"].fillna('') + ' ' + train_data["DESCRIPTION"].fillna('') + ' ' + train_data["BULLET_POINTS"].fillna('')
test_data["combined_text"] = test_data["TITLE"].fillna('') + ' ' + test_data["DESCRIPTION"].fillna('') + ' ' + test_data["BULLET_POINTS"].fillna('')

# Drop the individual text columns
train_data.drop(["TITLE", "DESCRIPTION", "BULLET_POINTS"], axis=1, inplace=True)
test_data.drop(["TITLE", "DESCRIPTION", "BULLET_POINTS"], axis=1, inplace=True)

# Fill missing values in product_type_id using SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
train_data["PRODUCT_TYPE_ID"] = imputer.fit_transform(train_data["PRODUCT_TYPE_ID"].values.reshape(-1, 1))
test_data["PRODUCT_TYPE_ID"] = imputer.fit_transform(test_data["PRODUCT_TYPE_ID"].values.reshape(-1, 1))

In [None]:
#droping missing values 
train_df.dropna(inplace=True)
print(train_df.isnull().sum())
train_df.info()
unique_product_df = train_df.TITLE.unique().tolist()
print(len(unique_product_df))

#run for now (vectorization for title, description, bullet **separately**) skip

In [None]:
#To perform feature engineering on the given dataset, we will do the following steps:

#1. Import necessary libraries
#2. Load the training dataset
#3. Preprocess text data
#4. Vectorize text data using TF-IDF
#5. Encode categorical data
#6. Combine features

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack

# Load the training dataset
train_dataa = pd.read_csv('dataset/preprocessed_train.csv')

# Preprocess text data
def preprocess_text(text):
    return text.str.lower().str.replace('[^a-z\s]', '')

train_dataa['TITLE'] = preprocess_text(train_dataa['TITLE'])
train_dataa['DESCRIPTION'] = preprocess_text(train_dataa['DESCRIPTION'])
train_dataa['BULLET_POINTS'] = preprocess_text(train_dataa['BULLET_POINTS'])

# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
title_matrix = vectorizer.fit_transform(train_dataa['TITLE'])
description_matrix = vectorizer.fit_transform(train_dataa['DESCRIPTION'])
bullet_points_matrix = vectorizer.fit_transform(train_dataa['BULLET_POINTS'])

# Encode categorical data
encoder = LabelEncoder()
product_type_id_encoded = encoder.fit_transform(train_dataa['PRODUCT_TYPE_ID'])

# Combine features
#X_train = hstack((title_matrix, description_matrix, bullet_points_matrix, product_type_id_encoded.reshape(-1, 1)))
X_train = pd.DataFrame(title_matrix, description_matrix, bullet_points_matrix, product_type_id_encoded.reshape(-1, 1))
y_train = pd.DataFrame(train_dataa['PRODUCT_LENGTH'])
#y_train = train_dataa['PRODUCT_LENGTH'].values

print("Features have been engineered.")

type(X)
type(y)

#Word2Vec (feature matrix include product_type_id)

In [None]:
#info
#a = np.array([[1, 2], [3, 4]])
#b = np.array([[5, 6], [7,8]])
#np.concatenate((a, b), axis=0)#along row
#np.concatenate((a, b), axis=1)#along column #horizontal stacking


In [None]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn.preprocessing import LabelEncoder


In [None]:
train_data = pd.read_csv('dataset/preprocessed_train.csv')

In [None]:
train_data.head()

In [None]:
train_data = train_data.sample(n=20000, random_state=123, replace = True)


In [None]:
def preprocess_text(text):
    return text.str.lower().str.replace('[^a-z\s]', '')

train_data['TITLE'] = preprocess_text(train_data['TITLE'])
train_data['DESCRIPTION'] = preprocess_text(train_data['DESCRIPTION'])
train_data['BULLET_POINTS'] = preprocess_text(train_data['BULLET_POINTS'])

print("punctuation removal and lower case conversion")

In [None]:
# Get the first half of the data
#half_data = train_data.iloc[:len(train_data)//2]

# Create a list of all the sentences in the selected columns
sentences = train_data['TITLE'].apply(lambda x: x.split()).tolist() + train_data['DESCRIPTION'].apply(lambda x: x.split()).tolist() + train_data['BULLET_POINTS'].apply(lambda x: x.split()).tolist()

# Train a Word2Vec model on the selected sentences
model = Word2Vec(sentences, min_count=1)


In [None]:
model

In [None]:
encoder = LabelEncoder()
product_type_id_encoded = encoder.fit_transform(train_data['PRODUCT_TYPE_ID'])


In [None]:
X_train = []
for i in range(train_df.shape[0]):
    title = train_df.iloc[i]['TITLE'].split()
    description = train_df.iloc[i]['DESCRIPTION'].split()
    bullet_points = train_df.iloc[i]['BULLET_POINTS'].split()
    features = [model.wv[word] for word in title + description + bullet_points]
    features = np.array(features).mean(axis=0)
    #X_train.append(np.concatenate((features, [product_type_id_encoded[i]])))
    X_train.append(features)
X_train = np.array(X_train)
y_train = train_data['PRODUCT_LENGTH'].values


#save & load pickle file skip

In [None]:
import pickle

# Save X and y as pickle files
with open('X.pickle', 'wb') as f:
    pickle.dump(X, f)

with open('y.pickle', 'wb') as f:
    pickle.dump(y, f)


In [None]:
import pickle

# Load X and y from pickle files
with open('X.pickle', 'rb') as f:
    X = pickle.load(f)

with open('y.pickle', 'rb') as f:
    y = pickle.load(f)

print(X)
print(y)

#this is working [1st pre processing] (doesnt yield any special result)

In [None]:
train_df = pd.read_csv('dataset/train.csv')


In [None]:
train_df.shape

In [None]:
train_df.head()

In [None]:
import pandas as pd
import numpy as np
import re

# Load the training data
train_df = pd.read_csv('dataset/train.csv')

# Remove duplicates based on TITLE
train_df.drop_duplicates(subset='TITLE', keep='first', inplace=True)

# Fill missing values with 'unknown'
train_df.fillna("unknown", inplace=True)

# Remove special characters and convert to lowercase
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    return text

# Combine TITLE, DESCRIPTION, and BULLET_POINTS into a single text feature
train_df['text'] = train_df['TITLE'] + ' ' + train_df['DESCRIPTION'] + ' ' + train_df['BULLET_POINTS']

# Preprocess the text feature
train_df['text'] = train_df['text'].apply(preprocess_text)

# Save the preprocessed data to a new CSV file
train_df.to_csv('dataset/preprocessed_train.csv', index=False)

In [None]:
train_df.shape

In [None]:
train_df.head()

In [None]:
train_df.columns

#this part is crashing skip

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


# Define the TF-IDF vectorizer
#vectorizer = TfidfVectorizer(max_features=10000)

# Define the Count vectorizer
vectorizer = CountVectorizer(max_features=10000)

# Fit and transform the training data
X_train_text = vectorizer.fit_transform(train_df['text'])


next part

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=10000)

# Process the data in chunks
X_train_list = []
for chunk in pd.read_csv('train.csv', chunksize=10000):
    # Preprocess the data
    chunk.fillna("unknown", inplace=True)

    # Fit and transform the chunk
    X_chunk = vectorizer.fit_transform(chunk['text'])
    X_train_list.append(X_chunk)

# Concatenate the transformed chunks into a single sparse matrix
X_train = scipy.sparse.vstack(X_train_list)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the TF-IDF vectorizer
vectorizer = TfidfVectorizer(stopwords=,max_features=10000)

# Split the data into smaller chunks
chunk_size = 10000
chunks = [train_df['text'][i:i+chunk_size] for i in range(0, len(train_df), chunk_size)]

# Fit the vectorizer on the first chunk
vectorizer.partial_fit(chunks[0])

# Transform the data in chunks
for chunk in chunks[1:]:
    X_chunk = vectorizer.transform(chunk)
    X_train = vstack([X_train, X_chunk])


#working coz of chunks, but crashes sometimes too 

In [None]:
!pip install scipy

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=10000)

# Process the data in chunks
X_train_list = []
i=0
for chunk in pd.read_csv('dataset/preprocessed_train.csv', chunksize=10000):
    
    print(i)
    i= i+1
    # Fit and transform the chunk
    X_chunk = vectorizer.fit_transform(chunk['text'])
    X_train_list.append(X_chunk)
    print(i)

import scipy
from scipy.sparse import vstack
# Concatenate the transformed chunks into a single sparse matrix
X_train_text = scipy.sparse.vstack(X_train_list)

# Save the transformed data
X_train_text = pd.DataFrame(X_train_text.toarray())
X_train_text.to_csv('dataset/X_train_text.csv', index=False)

In [None]:
X_train_list

In [None]:
# Save the transformed data
X_train_text = pd.DataFrame(X_train_text.toarray())
X_train_text.to_csv('X_train_text.csv', index=False)

# linear regression model

In [None]:
import pandas as pd

X_train = pd.concat([pd.DataFrame(X.toarray()), train_df['PRODUCT_TYPE_ID'].reset_index(drop=True)], axis=1)
y_train = train_df['PRODUCT_LENGTH']


In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(combined_text, y_train, test_size=0.7, random_state=42)


In [None]:
from sklearn.linear_model import LinearRegression

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)


In [None]:
print(accuracy)

In [None]:
from sklearn.metrics import mean_squared_error

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)


In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print('R-squared score:', r2)

#decision tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Train a decision tree regression model
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

#calculate accuracy
accuracy = model.score(X_test, y_test)
print("accuracy:", accuracy)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("RMSE:", rmse)


In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print('R-squared score:', r2)

#linear svr

In [None]:
from sklearn.svm import LinearSVR

# initialize the model
model = LinearSVR()

# fit the model on training data
model.fit(X_train, y_train)

# make predictions on test data
y_pred = model.predict(X_test)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(accuracy)

from sklearn.metrics import mean_squared_error

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)


#gradient booster

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor

# initialize the model
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)

# fit the model on training data
model.fit(X_train, y_train)

# make predictions on test data
y_pred = model.predict(X_test)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(accuracy)

from sklearn.metrics import mean_squared_error

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)


#random forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model
model = RandomForestRegressor(n_estimators=100, max_depth=10)

# Fit the model on training data
model.fit(X_train, y_train)

# Make predictions on test data
y_pred = model.predict(X_test)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print("accuracy: ", accuracy)

from sklearn.metrics import mean_squared_error

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:", rmse)


#XG Booster regressor

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Create XGBRegressor object with hyperparameters
xgb_model = xgb.XGBRegressor(
    max_depth=3, 
    learning_rate=0.1, 
    n_estimators=1200, 
    objective='reg:squarederror')

# Train the mo  del
xgb_model.fit(X_train, y_train)

# Predict on test data
y_pred = xgb_model.predict(X_test)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("RMSE:", rmse)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print("accuracy: ", accuracy)


In [None]:
# Import libraries
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional
from tensorflow.keras.models import Sequential
from sklearn.model_selection import train_test_split

# Define the sequence length and number of features
sequence_length = 100
num_features = 10000

# Tokenize the text
tokenizer = Tokenizer(num_words=num_features, oov_token="<OOV>")
tokenizer.fit_on_texts(combined_text)
word_index = tokenizer.word_index

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(combined_text)
padded_sequences = pad_sequences(sequences, maxlen=sequence_length, truncating='post')

# Create an RNN model
model = Sequential()
model.add(Embedding(input_dim=num_features, output_dim=64, input_length=sequence_length))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(1, activation='linear'))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=128)
