# Data Exploration and Cleaning

This notebook contains some exploration of the dataset and some cleaning for further analysis. The Readme file for the repository contains information on the data source. 

Because the CSV is compressed and present in the repository as a XZ file, and read directly compressed with Pandas.

In [5]:
import pandas as pd

First, we load the data, and explore its structure

In [10]:
df = pd.read_csv("../data/bq-results-20240802-233415-1722641679845.csv.xz", 
                 compression='xz')
df.shape

(138864, 18)

In [12]:
df.head()

Unnamed: 0,scientific_name,contig_acc,biosample_acc,bioproject_acc,target_acc,element_symbol,protein_acc,type,class,subclass,taxgroup_name,strain,serovar,isolation_source,geo_loc_name,epi_type,host,collection_date
0,Salmonella enterica,AAFUZC010000051.1,SAMN03098832,,PDT000041084.2,blaDHA,,AMR,BETA-LACTAM,CEPHALOSPORIN,Salmonella enterica,AM49198,,,USA,clinical,,
1,Salmonella enterica,AAFUZC010000051.1,SAMN03098832,,PDT000041084.2,ble,EBK1426116.1,AMR,BLEOMYCIN,BLEOMYCIN,Salmonella enterica,AM49198,,,USA,clinical,,
2,Salmonella enterica,AAFUZC010000051.1,SAMN03098832,,PDT000041084.2,blaNDM-1,EBK1426117.1,AMR,BETA-LACTAM,CARBAPENEM,Salmonella enterica,AM49198,,,USA,clinical,,
3,Salmonella enterica,AAFUZC010000073.1,SAMN03098832,,PDT000041084.2,blaNDM-1,EBK1426163.1,AMR,BETA-LACTAM,CARBAPENEM,Salmonella enterica,AM49198,,,USA,clinical,,
4,Salmonella enterica,AAFUZC010000073.1,SAMN03098832,,PDT000041084.2,ble,EBK1426164.1,AMR,BLEOMYCIN,BLEOMYCIN,Salmonella enterica,AM49198,,,USA,clinical,,


In [13]:
df.columns

Index(['scientific_name', 'contig_acc', 'biosample_acc', 'bioproject_acc',
       'target_acc', 'element_symbol', 'protein_acc', 'type', 'class',
       'subclass', 'taxgroup_name', 'strain', 'serovar', 'isolation_source',
       'geo_loc_name', 'epi_type', 'host', 'collection_date'],
      dtype='object')

The dataset has 138,886 rows and 18 columns. The description of the columns is the following:

- scientific_name: the bacterial species name
- contig_acc: the contig accesion number from NCBI, where the gene is present
- biosample_acc: the biosample accession number
- bioproject_acc: the bioproject accession number
- target_acc: unsure, this seems to be an accession number for the genome assembly version
- element_symbol: the gene name (e.g. _blaNDM-1_)
- protein_acc: the protein accession number for the gene
- type: the type of mechanism (e.g. AMR, antimicrobial resistance)
- class: the functional class for the mechanism in this row
- subclass: the functional subclass for the mechanism in this row
- taxgroup_name: The taxonomic group. It may be similar to the scientific name, but unsure at the moment
- strain: the strain for this genome
- serovar: For some taxa, it could be a serovar classification.
- isolation_source: The biological source of the sample (it seems). For example, blood.
- geo_loc_name: the geographical origin of the isolate
- epi_type: needs some exploration, it seems to be the type of strain (clinical, maybe environmental?)
- host: The organism where this bacteria was isolated, for example _Homo sapiens_ (human)
- collection_date: The date where the isolate was collected.

Just from the first five rows, we can see that not all the columns are complete, and we need some additional filtering to have the data ready for analysis.