# Web Mining and Applied NLP (44-620)

## Final Project

### Student Name: Jim Crivello  https://github.com/jmcriv/FinalProject

Hello!

Aside from my full-time work as an auditor, I am a part-time teacher and a part-time writer. Plagiarism is a big concern in both the teaching and writing domains. I use online tools such as Turnitin to compare my work or a student’s work to multiple online sources. My goal with this project is to write a program that compares documents for potential plagiarism issues.

When I use Turnitin to check for similarities, the program compares one paper to the multiple documents searched on the internet. The problem is that since other student work is turned in at the same time, Turnitin will usually not have the other students work in the database for comparison. The target of my project will be a comparison of multiple student papers to not only the internet, but also each other to see similarities.

I will be using an extension of our API calls for music lyrics to serve as the student papers. I will be downloading the lyrics and then creating multiple variations of the JSONs copied to text files to simulate student papers.

#### ---------------------------------------------

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.



### Coding below is from the Module Four JSON assignment

In [15]:
import lyricsgenius
import json
import os
from dotenv import load_dotenv

load_dotenv()

def get_genius_access_token():
    return os.getenv("GENIUS_ACCESS_TOKEN")

def get_song_lyrics(artist, song, filename):
    access_token = get_genius_access_token()
#    genius = lyricsgenius.Genius(access_token)
    genius = lyricsgenius.Genius('xxxxxxxxxxxxxxxxxxxxxxxxx')
    try:
        song = genius.search_song(song, artist)
        if song:
            lyrics = song.lyrics
            with open(filename, 'w') as file:
                json.dump(lyrics, file)
            print(f"Lyrics for '{song.title}' by {song.artist} have been saved to {filename}")
        else:
            print(f"Failed to retrieve lyrics for '{song}' by {artist}")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

get_song_lyrics("Steely Dan", "Pretzel Logic", "pretzel_logic.json")
get_song_lyrics("Led Zepplin", "Stairway to Heaven", "stairway_heaven.json")
get_song_lyrics("Eagles", "One of these Nights", "these_nights.json")
get_song_lyrics("Pink Floyd", "Money", "money.json")


Searching for "Pretzel Logic" by Steely Dan...
Done.
Lyrics for 'Pretzel Logic' by Steely Dan have been saved to pretzel_logic.json
Searching for "Stairway to Heaven" by Led Zepplin...
Done.
Lyrics for 'Stairway to Heaven' by Led Zeppelin have been saved to stairway_heaven.json
Searching for "One of these Nights" by Eagles...
Done.
Lyrics for 'One of These Nights' by Eagles have been saved to these_nights.json
Searching for "Money" by Pink Floyd...
Done.
Lyrics for 'Money' by Pink Floyd have been saved to money.json


### I struggled with converting the JSON objects to text files. I will continue my practice. I manually created the text files referenced in the next section.

### Following code was referenced from https://hackernoon.com/how-to-detect-plagiarism-in-text-using-python-zn213tw7

The result of this code will compare each of the five text files to the other four text files for plagiarism with a resulting percentage score for originality.

I copied portions of each of the four music lyrics text files to paper.txt to simulate a student copying data from all four documents.


In [16]:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

student_files = [doc for doc in os.listdir() if doc.endswith('.txt')]
student_notes =[open(File).read() for File in  student_files]

vectorize = lambda Text: TfidfVectorizer().fit_transform(Text).toarray()
similarity = lambda doc1, doc2: cosine_similarity([doc1, doc2])

vectors = vectorize(student_notes)
s_vectors = list(zip(student_files, vectors))

def check_plagiarism():
    plagiarism_results = set()
    global s_vectors
    for student_a, text_vector_a in s_vectors:
        new_vectors =s_vectors.copy()
        current_index = new_vectors.index((student_a, text_vector_a))
        del new_vectors[current_index]
        for student_b , text_vector_b in new_vectors:
            sim_score = similarity(text_vector_a, text_vector_b)[0][1]
            student_pair = sorted((student_a, student_b))
            score = (student_pair[0], student_pair[1],sim_score)
            plagiarism_results.add(score)
    return plagiarism_results

for data in check_plagiarism():
        print(data)

('pretzel_logic.txt', 'stairway_heaven.txt', 0.3177821102128199)
('money.txt', 'pretzel_logic.txt', 0.21068201156000455)
('paper.txt', 'pretzel_logic.txt', 0.5661774583635705)
('money.txt', 'these_nights.txt', 0.15761119701841553)
('money.txt', 'stairway_heaven.txt', 0.2613440903770172)
('paper.txt', 'stairway_heaven.txt', 0.5692006433462361)
('pretzel_logic.txt', 'these_nights.txt', 0.13873148368337798)
('paper.txt', 'these_nights.txt', 0.4398902693353228)
('stairway_heaven.txt', 'these_nights.txt', 0.2322781837981371)
('money.txt', 'paper.txt', 0.4895340966051793)
