## End To End Azure NLP Project: Detect AI Generated Text 

### Clean Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
%pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [3]:
from datasets import load_dataset
# from huggingface_hub import list_datasets
# print(len([dataset.id for dataset in list_datasets()]))

LLM_gen_dataset = load_dataset("perlthoughts/big-brain-4k")
LLM_train_set = LLM_gen_dataset['train']

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
human_gen_dataset = load_dataset("qwedsacf/ivypanda-essays")
human_gen_dataset_train = human_gen_dataset['train']

#### Convert the Data Into Pandas Dataframes

In [5]:
df_human = pd.DataFrame(human_gen_dataset_train)
df_AI = pd.DataFrame(LLM_train_set)

#### Dropping The Unnecessary Columns

In [6]:
df_human_pcs = df_human.drop(['SOURCE','__index_level_0__'],axis=1)
df_AI_pcs = df_AI.drop(['system','prompt'],axis=1)

#### Removing The Empty Strings  

In [7]:
def rplc_emptystr_w_nan(df_data_pcs):
    df_trns_nan = df_data_pcs.map(lambda x: np.nan if isinstance(x, str) and x.strip() == '' else x)
    # Create a boolean Series where each value indicates if any value in the row is NaN
    bool_series = df_trns_nan.isna().any(axis=1)
    # Use the boolean Series to index the DataFrame
    rows_with_nan = df_trns_nan[bool_series] 
    return rows_with_nan

In [8]:
rplc_emptystr_w_nan(df_human_pcs)
#No empty strings in human pcs

Unnamed: 0,TEXT


In [9]:
# Sample DataFrame
data = {'Column1': [1, 2, np.nan, 4],
        'Column2': [np.nan, 2, 3, 4],
        'Column3': [1, 0.2, 3, 4]}
df = pd.DataFrame(data)
print(df.head()) 
rplc_emptystr_w_nan(df)

   Column1  Column2  Column3
0      1.0      NaN      1.0
1      2.0      2.0      0.2
2      NaN      3.0      3.0
3      4.0      4.0      4.0


Unnamed: 0,Column1,Column2,Column3
0,1.0,,1.0
2,,3.0,3.0


In [10]:
df_AI_pcs

Unnamed: 0,output
0,The review is neutral. The reviewer did not ha...
1,"Okay, let's solve this math problem together! ..."
2,"As an AI, I understand you are asking for a tw..."
3,The sentence is acceptable. It means that the ...
4,The article does not provide the last name of ...
...,...
249995,"First, we find the prime factorization of each..."
249996,The prime numbers in the list are 23 and 29.\n...
249997,The students are advised to eat normal-sized m...
249998,"Jean thought ""David"" was special because he ma..."


In [11]:
df_AI.head()

Unnamed: 0,system,prompt,output
0,You are an AI assistant. Provide a detailed an...,Title: I did not get to see it because I could...,The review is neutral. The reviewer did not ha...
1,"You are a helpful assistant, who always provid...",Solve this math problem\n\nSolve -20*l + 41*l ...,"Okay, let's solve this math problem together! ..."
2,You are an AI assistant. You will be given a t...,Sentiment possibilities Possible answers: 1). ...,"As an AI, I understand you are asking for a tw..."
3,"You are a helpful assistant, who always provid...",Multi-choice problem: Is the next sentence syn...,The sentence is acceptable. It means that the ...
4,You are an AI assistant that follows instructi...,I have a test where I am given the following a...,The article does not provide the last name of ...


In [12]:
print(df_AI["prompt"][15637])

The following article contains an answer for the question: Who steals supplies from other trucks? , can you please find it?   Cooper and Durazno knock out a truck driver and steal his rig. They take it back to a shop where it is repainted and the numbers are filed. In it they find a truckload of carburetors. Cooper abandons Durazno at a gas station and sets out as an independent driver of the yellow Peterbilt. He picks up a hitchhiker but refuses to also give a ride to the man's accompanying woman and dog. At a diner the two notice the Duke of Interstate 40 (Hector Elizondo) eating at another table. Cooper asks him about his rig, which annoys the Duke. Cooper and the hitchhiker watch Samson and Delilah at a drive-in as Cooper discusses professions he's considered as a means to make money and how he reads the almanac so that he can be learning and earning money at the same time. Cooper visits a shopkeeper and attempts to earn money by either selling some of the stolen carburetors or hus

In [13]:
rplc_emptystr_w_nan(df_AI_pcs)

Unnamed: 0,output
4408,
15637,
31616,
33376,
51534,
57974,
87873,
107100,
108134,
123018,


In [14]:
def rplc_emptystr_w_nan_v2(df_data_pcs):
    df_trns_nan = df_data_pcs.map(lambda x: np.nan if isinstance(x, str) and x.strip() == '' else x)
    # Create a boolean Series where each value indicates if any value in the row is NaN
    bool_series = df_trns_nan.isna().any(axis=1)
    # Use the boolean Series to index the DataFrame
    rows_with_nan = df_trns_nan[bool_series]
     
    return df_trns_nan, rows_with_nan


In [15]:
df_AI_wnan, df_rows_nan = rplc_emptystr_w_nan_v2(df_AI_pcs)
df_AI_pcs = df_AI_wnan.dropna().reset_index(drop=True) 

In [16]:
df_AI_pcs

Unnamed: 0,output
0,The review is neutral. The reviewer did not ha...
1,"Okay, let's solve this math problem together! ..."
2,"As an AI, I understand you are asking for a tw..."
3,The sentence is acceptable. It means that the ...
4,The article does not provide the last name of ...
...,...
249970,"First, we find the prime factorization of each..."
249971,The prime numbers in the list are 23 and 29.\n...
249972,The students are advised to eat normal-sized m...
249973,"Jean thought ""David"" was special because he ma..."


#### Remove Duplicates

In [17]:
df_AI_pcs[df_AI_pcs.duplicated()]
# df_AI_pcs["output"][249929]

Unnamed: 0,output
214,No
313,No
450,The review is positive.
542,No.
587,No
...,...
249900,Educational institution.
249902,The writer's purpose of writing the passage is...
249929,"An example of a tweet is: ""Just finished a gre..."
249944,"Yes, this product review is negative."


In [18]:
#Resetting the indices after removing the duplicates.
df_AI_pcs = df_AI_pcs.drop_duplicates().reset_index(drop=True)

In [19]:
df_human_pcs[df_human_pcs.duplicated()]

Unnamed: 0,TEXT


#### Remove URLS

In [20]:
#Check if the text data has URL: Source: ChatGPT
import re
def contains_url(text: str) -> bool:
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return bool(url_pattern.search(text))

data = {
    'text': [
        "Check out this link: https://www.example.com",
        "No URL here!",
        "Another link: http://example.org",
        "This text contains a URL: www.test.com"
    ]
}

df = pd.DataFrame(data)
df['has_url'] = df['text'].apply(contains_url)
df.head()

Unnamed: 0,text,has_url
0,Check out this link: https://www.example.com,True
1,No URL here!,False
2,Another link: http://example.org,True
3,This text contains a URL: www.test.com,True


In [21]:
df_AI_dummy = pd.DataFrame()
df_AI_dummy["has_url"] = df_AI_pcs["output"].apply(lambda x: contains_url(x))

In [22]:
df_AI_dummy[df_AI_dummy["has_url"]==True]
#ok there are URLS in AI text.

Unnamed: 0,has_url
184,True
350,True
467,True
1278,True
1376,True
...,...
230728,True
230778,True
231572,True
232245,True


In [23]:
print(df_AI_pcs["output"][184])

As an AI language model, I can't provide a complete list of all tutorials available for LISP programming, but I can help get you started with some essential resources and a brief overview of the language.

LISP (short for List Processing) is one of the oldest high-level programming languages, primarily used for artificial intelligence (AI) and symbolic computing. The language is characterized by its parenthetical syntax and support for powerful data manipulation using lists.

Here are some resources to help you learn LISP:

1. **Official documentation**: The "Common Lisp Hyperspec" is a comprehensive online reference to the Common Lisp language. Start here for definitive information on language features, syntax, and standard library: http://www.lispworks.com/documentation/HyperSpec/Front/index.htm

2. **Books**:
   - "Practical Common Lisp" by Peter Seibel: This book is suitable for beginners and provides practical examples of LISP programming. The full text is available online: http:/

In [24]:
import re
example="New competition launched :https://www.kaggle.com/c/nlp-getting-started"
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

remove_URL(example)

'New competition launched :'

In [25]:
df_AI_pcs["output"] = df_AI_pcs["output"].apply(lambda x : remove_URL(x))
print(df_AI_pcs)
print(df_AI_pcs["output"][184])

                                                   output
0       The review is neutral. The reviewer did not ha...
1       Okay, let's solve this math problem together! ...
2       As an AI, I understand you are asking for a tw...
3       The sentence is acceptable. It means that the ...
4       The article does not provide the last name of ...
...                                                   ...
233167  We can convert $\frac{5}{14}$ into a decimal b...
233168  First, we find the prime factorization of each...
233169  The students are advised to eat normal-sized m...
233170  Jean thought "David" was special because he ma...
233171                                          homology.

[233172 rows x 1 columns]
As an AI language model, I can't provide a complete list of all tutorials available for LISP programming, but I can help get you started with some essential resources and a brief overview of the language.

LISP (short for List Processing) is one of the oldest high-level progra

In [26]:
df_human_dummy = pd.DataFrame()
df_human_dummy["has_url"] = df_human_pcs["TEXT"].apply(lambda x: contains_url(x))
df_human_dummy[df_human_dummy["has_url"] == True]


Unnamed: 0,has_url
31,True
291,True
341,True
737,True
1460,True
...,...
128235,True
128236,True
128239,True
128279,True


In [27]:
print(df_human_pcs["TEXT"][31])

Accounting Basics and How to Remember Them Essay

The rhythmic accounting rap song Debit Credit Theory by Colin Dodds (n.d.) is an excellent way to remember that debit’s location is the left side of the account, and credit’s location is the right one. This unusual song also explains the meaning of these two terms. This fun but an educational source has enabled me to remember the material and not get confused in the concepts.

A link: https://www.youtube.com/watch?v=j71Kmxv7smk

The mnemonic offered by Heather McNellis (2020) also contributes to better understanding and remembering the essence of debit and credit and the account balances. The DEAL/CLIP mnemonic contains the explanation of debit’s and credit’s parts. Due to this engaging technic, it is pretty easy to understand that while debit includes Drawings, Expenses, Assets, and Losses, credit consists of Capital, Liabilities, Income, as well as Profits.

A link: https://www.icas.com/students/learning-blog/test-of-competence/financ

In [28]:
df_human_pcs["TEXT"] = df_human_pcs["TEXT"].apply(lambda x : remove_URL(x))
print(df_human_pcs["TEXT"][31])

Accounting Basics and How to Remember Them Essay

The rhythmic accounting rap song Debit Credit Theory by Colin Dodds (n.d.) is an excellent way to remember that debit’s location is the left side of the account, and credit’s location is the right one. This unusual song also explains the meaning of these two terms. This fun but an educational source has enabled me to remember the material and not get confused in the concepts.

A link: 

The mnemonic offered by Heather McNellis (2020) also contributes to better understanding and remembering the essence of debit and credit and the account balances. The DEAL/CLIP mnemonic contains the explanation of debit’s and credit’s parts. Due to this engaging technic, it is pretty easy to understand that while debit includes Drawings, Expenses, Assets, and Losses, credit consists of Capital, Liabilities, Income, as well as Profits.

A link: 

I have chosen these two sources since they make it easy to remember accounting basics. They are pretty bright,

#### Remove Emojis

In [29]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔")

'Omg another Earthquake '

In [30]:
df_AI_pcs["output"] = df_AI_pcs["output"].apply(lambda x : remove_emoji(x))
df_human_pcs["TEXT"] = df_human_pcs["TEXT"].apply(lambda x : remove_emoji(x))

#### Remove HTML Tags

In [31]:
#Find and remove the HTML tags:
def contains_html(text):
    html=re.compile(r'<.*?>')
    return bool(html.search(text))

In [32]:
df_human_dummy = pd.DataFrame()
df_human_dummy["has_html"] = df_human_pcs["TEXT"].apply(lambda x: contains_html(x))
df_human_dummy[df_human_dummy["has_html"] == True]

Unnamed: 0,has_html
373,True
709,True
1478,True
1630,True
1693,True
...,...
127927,True
127934,True
127936,True
127937,True


In [33]:
print(df_human_pcs["TEXT"][373])

Hadoop Platform: Usage and Significance Research Paper

Hadoop and its Operations

Hadoop is an open-source platform commonly used for the storage and processing of big data. It is an approach that enhances the efficiency scale in analyzing massive datasets across different connected computers. Although a single computer can process big data based on schematics, networking improves productivity in processing information from dynamic sectors. Hadoop is a technological tool essential in the modern world due to the increasing gathering and processing of big data (Data Flair, 2022). Different institutions use the dataset for dynamic purposes, such as analyzing the behavioral indicator during the purchasing process. Therefore, it is the responsibility of executive teams to establish critical frameworks that enhance the derivation of in-depth details concerning the distinctive variables. The lack of clarity fosters subjective decision-making that results in bias due to inadequate knowledge o

In [34]:
df_AI_dummy = pd.DataFrame()
df_AI_dummy["has_html"] = df_AI_pcs["output"].apply(lambda x: contains_html(x))
df_AI_dummy[df_AI_dummy["has_html"] == True]

Unnamed: 0,has_html
23,True
86,True
427,True
457,True
744,True
...,...
232342,True
232910,True
233030,True
233107,True


In [35]:
example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)
print(remove_html(example))


Real or Fake
Kaggle 
getting started



In [36]:
df_AI_pcs["output"] = df_AI_pcs["output"].apply(lambda x : remove_html(x))
df_human_pcs["TEXT"] = df_human_pcs["TEXT"].apply(lambda x : remove_html(x))

In [37]:
#print(df_AI_pcs["output"][23])
print(df_human_pcs["TEXT"][373])

Hadoop Platform: Usage and Significance Research Paper

Hadoop and its Operations

Hadoop is an open-source platform commonly used for the storage and processing of big data. It is an approach that enhances the efficiency scale in analyzing massive datasets across different connected computers. Although a single computer can process big data based on schematics, networking improves productivity in processing information from dynamic sectors. Hadoop is a technological tool essential in the modern world due to the increasing gathering and processing of big data (Data Flair, 2022). Different institutions use the dataset for dynamic purposes, such as analyzing the behavioral indicator during the purchasing process. Therefore, it is the responsibility of executive teams to establish critical frameworks that enhance the derivation of in-depth details concerning the distinctive variables. The lack of clarity fosters subjective decision-making that results in bias due to inadequate knowledge o

#### Remove Punctuations:

In [38]:
import string
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

example="I am a #king, she is a #queen."
print(remove_punct(example))

I am a king she is a queen


In [39]:
df_AI_pcs["output"]

0         The review is neutral. The reviewer did not ha...
1         Okay, let's solve this math problem together! ...
2         As an AI, I understand you are asking for a tw...
3         The sentence is acceptable. It means that the ...
4         The article does not provide the last name of ...
                                ...                        
233167    We can convert $\frac{5}{14}$ into a decimal b...
233168    First, we find the prime factorization of each...
233169    The students are advised to eat normal-sized m...
233170    Jean thought "David" was special because he ma...
233171                                            homology.
Name: output, Length: 233172, dtype: object

In [40]:
df_AI_pcs["output"] = df_AI_pcs["output"].apply(lambda x : remove_punct(x))
df_human_pcs["TEXT"] = df_human_pcs["TEXT"].apply(lambda x : remove_punct(x))

In [41]:
df_AI_pcs["output"]

0         The review is neutral The reviewer did not hav...
1         Okay lets solve this math problem together \n\...
2         As an AI I understand you are asking for a twe...
3         The sentence is acceptable It means that the s...
4         The article does not provide the last name of ...
                                ...                        
233167    We can convert frac514 into a decimal by long ...
233168    First we find the prime factorization of each ...
233169    The students are advised to eat normalsized me...
233170    Jean thought David was special because he made...
233171                                             homology
Name: output, Length: 233172, dtype: object

In [46]:
print("length of df_AI_pcs:", len(df_AI_pcs.index))
print("length of df_human_pcs:", len(df_human_pcs.index))

length of df_AI_pcs 233172
length of df_human_pcs 128293


In [48]:
df_AI_clean = df_AI_pcs.iloc[:len(df_human_pcs.index),:]
df_human_clean = df_human_pcs.iloc[:len(df_human_pcs.index),:]

In [51]:
df_AI_clean["label"] = 1
df_human_clean["label"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_AI_clean["label"] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_human_clean["label"] = 0


In [53]:
df_AI_clean.head()
df_human_clean.head()

Unnamed: 0,TEXT,label
0,12 Years a Slave An Analysis of the Film Essay...,0
1,20 Social Media Post Ideas to Radically Simpli...,0
2,2022 Russian Invasion of Ukraine in Global Med...,0
3,533 US 27 2001 Kyllo v United States The Use o...,0
4,A Charles Schwab Corporation Case Essay\n\nCha...,0


In [54]:
# df_AI_clean.to_csv("../Clean_data/ai_gen_text_v2.csv", index = False)
# df_human_clean.to_csv("../Clean_data/human_wrttn_text_v2.csv", index = False)

#### Spelling Checks:

In [41]:
# !pip install pyspellchecker

In [42]:
# from spellchecker import SpellChecker

# spell = SpellChecker()
# def correct_spellings(text):
#     corrected_text = []
#     misspelled_words = spell.unknown(text.split())
#     for word in text.split():
#         if word in misspelled_words:
#             corrected_text.append(spell.correction(word))
#         else:
#             corrected_text.append(word)
#     return " ".join(corrected_text)
        
# text = "corect me plese"
# correct_spellings(text)

In [43]:
#df['text']=df['text'].apply(lambda x : correct_spellings(x)#)