Skip to content

The objectives of this project are the following: Implement the DBSCAN clustering algorithm. Deal with text data (news records) in document-term sparse matrix format. Design a proximity function for text data. Think about the Curse of Dimensionality. Use dimensionality reduction if needed. Think about best metrics for evaluating clustering solut…

Notifications You must be signed in to change notification settings

prasun1/Text-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Text-Clustering

OBJECTIVE Implement the DBSCAN clustering algorithm. Deal with text data (news records) in document-term sparse matrix format. Think about best metrics for evaluating clustering solutions. Design a proximity function for text data. Think about the Curse of Dimensionality.

INTRODUCTION Without using libraries, we have to implement DBSCAN clustering algorithm. The data provided is in the form of sparse matrix. The other important task is to find the optimal value for eps(threshold) for DBSCAN algorithm. Input data (provided as training data) consists of 8580 text records in sparse format. No labels are provided. SCOPE

• Find a way to deal with sparse data.

• Create a CSR matrix.

• Find a way to reduce dimensionality

• Find a way to find effective eps value

• Implement DBSCAN METHODOLOGY

• SPLIT DATA AND CONVERT IT TO CSR MATRIX

  First, I splitted the data into two list valslist1 and valslist2, respectively. Then I created csr matrix using activity 2(k-NN). Next step is to normalize the matrix to a standard scale. The matrix is of dimensions 8650*125680 approx.

• DIMENSIONALITY REDUCTION Next step is to reduce the dimensionality of data as the data is quite large and was facing a lot of issues while dealing with it. I have used random projection and TruncateSVD for this, as PCA does not work on Sparse data. Among, Random Projection and TruncateSVD, TruncateSVD performs better and number of components selected are random by hit and trial method. tsvd = TruncatedSVD(n_components=150) X_sparse_tsvd = tsvd.fit(mat2).transform(mat2)

• FINDING EPS(threshold) FOR DBSCAN I have used nearestneighbour algorithm to find k neighbors and sorted them in descending order. The value of k is randomly selected. I have tried elbow method for finding optimal value of k but that didn’t work. The plot after sorting distances is as shown below which shows ideal value is 0.33 approx.

• FINDING DISTANCE BETWEEN DIFFERENT POINTS I have used Euclidean distance to calculate distance between different points and plotted a graph which shows that data is densely clustered. I have also used this proximity measure to find distance between any two points while creating clusters in DBSCAN

• IMPLEMENTING DBSCAN The most important part of this assignment is to implement DBSCAN from scratch. Implementing DBSCAN is multi-way process. I have used the same value of MinPts as I used to find and finally settled on 7. Eps is as mentioned above is 0.33 approx. Any point that lies within the eps and have points greater than MinPts under the eps radius is considered Core point and if any of these two condition fails than it is a noise point (-1). I have tried different ways of solving DBSCAN problem but

About

The objectives of this project are the following: Implement the DBSCAN clustering algorithm. Deal with text data (news records) in document-term sparse matrix format. Design a proximity function for text data. Think about the Curse of Dimensionality. Use dimensionality reduction if needed. Think about best metrics for evaluating clustering solut…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages