# Machine Learning Final Project

## Step 0: Project Ideation & Dataset Inspection

### Instructions
1. **Dataset Loading**:

    - For each plan (e.g., Plan A, Plan B, etc.), read in the corresponding dataset.

2. **Inspect the Dataset**:

    - After loading the dataset, display the first few rows to understand its structure.

    - Identify and list the types of fields (e.g., numerical, categorical, text).

    - Attempt to identify the target variable if your project involves supervised learning.

### Plan A: Animal drug use adverse event

### Plan B: Biomedical Image Analysis

In [None]:
# Your code here

### Plan C: Website Fact Checker

In [1]:
import pandas as pd
# Let's load and inspect the contents of the uploaded file to see what's inside.
file_path = '/workspaces/miami-ds-5-final-project/data/raw/en.es.train.jl'

with open(file_path, 'r') as file:
    content = file.readlines()

# Display the first few lines to understand the file's structure and content
content[:]

data = pd.DataFrame(content)

data

Unnamed: 0,0
0,"{""src_id"": ""8221"", ""src_query"": ""Death"", ""tgt_..."
1,"{""src_id"": ""5334607"", ""src_query"": ""Africa"", ""..."
2,"{""src_id"": ""26769"", ""src_query"": ""South Americ..."
3,"{""src_id"": ""21241"", ""src_query"": ""Norway"", ""tg..."
4,"{""src_id"": ""26994"", ""src_query"": ""Scotland"", ""..."
...,...
9995,"{""src_id"": ""357953"", ""src_query"": ""Red-rumped ..."
9996,"{""src_id"": ""6583"", ""src_query"": ""Chinese cuisi..."
9997,"{""src_id"": ""352100"", ""src_query"": ""Broad-bille..."
9998,"{""src_id"": ""225615"", ""src_query"": ""Terminator ..."


In [6]:
import pandas as pd
import json
import os
import csv

# Load the contents of the file
file_path = '/workspaces/miami-ds-5-final-project/data/raw/en.es.train.jl'

# Check file size
file_size = os.path.getsize(file_path)
print(f"File size: {file_size} bytes")

# Read file content
with open(file_path, 'r') as file:
    content = file.readlines()

# Check number of lines
print(f"Number of lines read: {len(content)}")

# Process each line to extract relevant data
data = []
for line in content:
    entry = json.loads(line)
    src_id = entry['src_id']
    src_query = entry['src_query']
    for result in entry['tgt_results']:
        result_id, score = result
        data.append([src_id, src_query, result_id, score])

# Create a DataFrame
df = pd.DataFrame(data, columns=['src_id', 'src_query', 'result_id', 'score'])

# Check number of rows in DataFrame
print(f"Number of rows in DataFrame: {len(df)}")

# Display first few rows
print(df.head(1000))




File size: 16626510 bytes
Number of lines read: 10000
Number of rows in DataFrame: 1000000
    src_id  src_query result_id  score
0     8221      Death      1942      6
1     8221      Death   1817604      4
2     8221      Death   7609604      4
3     8221      Death   7267253      4
4     8221      Death   1706783      4
..     ...        ...       ...    ...
995  17675  Lithuania   4557524      0
996  17675  Lithuania   4728781      0
997  17675  Lithuania   4780367      0
998  17675  Lithuania   5559984      0
999  17675  Lithuania   4974662      0

[1000 rows x 4 columns]


In [2]:
import pandas as pd
# Path to the text file
text_file_path = '/workspaces/miami-ds-5-final-project/data/raw/es.tsv'

# Load the text file into a DataFrame
text_df = pd.read_csv(text_file_path, sep='\t', header=None, names=['result_id', 'doc_text'])

# Display the first few rows of the text DataFrame
print(text_df.head(100))


    result_id                                           doc_text
0      842510         [{{fullurl:{{{2}}}|action=edit}} editar]  
1     7503491                                               Ó⇔¿?
2      855896         [{{fullurl:{{{2}}}|action=edit}} editar]  
3     5444240  La palabra mácula (del latín macŭla, «mancha»)...
4     4292429         [{{fullurl:{{{2}}}|action=edit}} editar]  
..        ...                                                ...
95    1467940                                            Imagen3
96    5338428  Puedes colaborar con artículos en el Wikiproye...
97    1437817  Hitlers Zweites Buch (Segundo libro de Hitler)...
98    5619132          Archivo:Dakar2006 Ullevalseter Esteve.jpg
99     494840  Wikipedia:Bienvenido Introducción a Wikipedia ...

[100 rows x 2 columns]


In [7]:
import pandas as pd

# Filter out rows where 'doc_text' contains the unwanted text, accounting for any hidden characters or spaces
df_filtered = text_df[~text_df['doc_text'].str.strip().eq("[{{fullurl:{{{2}}}|action=edit}} editar]")]
df_filtered1 = df_filtered[~df_filtered['doc_text'].str.strip().eq("Ó⇔¿?")]

# Display the filtered DataFrame
print("\nFiltered DataFrame:")
print(df_filtered1)


Filtered DataFrame:
         result_id                                           doc_text
3          5444240  La palabra mácula (del latín macŭla, «mancha»)...
6          6239109  Bailando por un sueño 2014 es la tercera tempo...
8          2593040                                Wikiproyecto África
10         7188837  El área de Arequipa Metropolitana, es un área ...
11         4639639  12012 (イチニーゼロイチニ, ichi-ni-zero-ichi-ni?) es un...
...            ...                                                ...
1578432    8832262  Victoria Lynn Rowell (Portland, Maine, 10 de m...
1578433    8831954  Cara Seymour (Essex, 6 de enero de 1964) es un...
1578434    8832982  Claudia Fernández Valdivia (La Paz, 30 de ener...
1578435     489700  El Achibueno es un río, tributario del Río Lon...
1578436    7823034  Valentín de Verástegui Barona (Vitoria, 1789 -...

[1578396 rows x 2 columns]
