## Now that we know a little Python: using AI as a helper

Let's see if we can use [ChatGPT](https://chat.openai.com/) to walk us through how to do a complicated problem.

In April 2024, the Washington Post published [Inside the secret list of websites that make AI like ChatGPT sound smart
](https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/). This story analyzed the [C4 dataset](https://huggingface.co/datasets/allenai/c4), a selection of data that's part of the training process for large language models like ChatGPT.

It showed how much content was from Wikipedia, whether business or hobby websites were more popular, and even included a tool that allowed you to search whether your website was included in the dataset.

Let's see if we can do our own analysis! We're going to use the C4M dataset, which is the multilingual version.

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 400

# We're using a CSV right from the internet, but you can visit the URL if you'd like
df = pd.read_csv("https://raw.githubusercontent.com/jsoma/2024-birn/main/01-pandas/c4m-tiny-sample.csv", nrows=3000)
df.head(10)

## Let's get crazy

What langage is each one of these in? Let's get crazy by [seeing what ChatGPT can help us do](https://chatgpt.com/). This will be an exercise in asking specific questions, troubleshooting problems, and having a back-and-forth conversation with AI tools.

In [None]:
pip install langdetect

In [None]:
import pandas as pd
from tqdm import tqdm
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

tqdm.pandas()

# Assuming df is your DataFrame and 'text' is the column containing the text data
def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return 'unknown'  # In case language detection fails

# Add the 'lang' column to the DataFrame
df['lang'] = df['text'].progress_apply(detect_language)

In [None]:
df

In [None]:
from langdetect import detect_langs
from langdetect.lang_detect_exception import LangDetectException

def detect_language_with_confidence(text):
    try:
        detections = detect_langs(text)
        if detections:
            top_detection = detections[0]
            return top_detection.lang, top_detection.prob
        else:
            return 'unknown', 0.0
    except LangDetectException:
        return 'unknown', 0.0  # In case language detection fails

# Apply the function to get both language and confidence score
df[['lang', 'confidence']] = df['text'].progress_apply(lambda x: pd.Series(detect_language_with_confidence(x)))

In [None]:
df

In [None]:
df['lang'].value_counts(normalize=True)

In [None]:
pip install fasttext

In [None]:
import urllib.request

# URL for the lid.176.bin model
url = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin"
output_path = "lid.176.bin"

# Download the file
print("Downloading lid.176.bin...")
urllib.request.urlretrieve(url, output_path)
print("Download complete!")

In [None]:
import fasttext

# Load the pre-trained language identification model
model = fasttext.load_model('lid.176.bin')

def detect_language_fasttext(text):
    # Clean the text by removing newlines
    cleaned_text = text.replace('\n', ' ').strip()
    predictions = model.predict(cleaned_text)
    lang = predictions[0][0].replace('__label__', '')
    confidence = predictions[1][0]
    return lang, confidence

df[['lang', 'confidence']] = df['text'].progress_apply(lambda x: pd.Series(detect_language_fasttext(x)))

In [None]:
df['lang'].value_counts(normalize=True)

In [None]:
df[df['lang'].isin(['hr', 'sr'])]

In [None]:
pd.options.display.max_colwidth = 500

In [None]:
df[df['lang'].isin(['hr', 'sr'])]

## Saving the results

In [None]:
df.to_csv("edited.csv", index=False)