## Class 2  
Downloading and entering data  
Basic stats and plots  

In the last class you entered data yourself.  Most of the data analysis you will be doing in this class, and likely in the future, will be using available datasets from which you can extract the data you want to examine.  In the class you will practise downloading data sets and opening them as pandas dataframe ready for cleaning and subsetting (next week).  We start with a few examples but feel free to find your own datasets to experiment with.

In [21]:
# Analysis modules
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

### Easy start. 
Pandas has a method for directly reading in excel sheets  

    df = pd.read_excel('my_data.xls')  
    
and for reading in comma delineated files  

    df = pd.read_csv('my_data.csv')  
    
read in count.xls and count.csv

They are in the folder Datasets, which is in the folder above this one so the full path is:   
    ../Datasets/count.xls

count.txt is tab delineated.  We need to specify that the delimiter is tabs.  
The parameter for delimiter is:

   sep =
   
Read in count.txt

You can also read in tab delineated files using  

    pd.read_table()
    
Check this.

Once it's in you can check the file structure is as you expect using  

    df.info()  

which gives you the full story, or 
    
    df.dtypes  
    
which tells you how pandas has coded each column of data

Compare both outputs.

There are 3 duff datasets in ../Datasets/

    count_duff1.txt
    count_duff2.txt
    count_duff3.txt
 

Read each into a dataframe with header, counts as integers and the field number as a string.



Useful parameters:

    field separator: sep =
    specify the null values: na_values =
    set column headers: header=None (or specify a list)
    Make data a specified type: dtype={'Column_A': 'string'})
    ignore problematic rows: error_bad_lines=False

### The parts of a dataframe  
We've defined column headers, but there are also row labels in the left-most column.  This is the dataframe index.  Read in the dataframe count_things.txt with column names set with this list. 

    col_names = ["Meadow", "Pigs","Cows", "Potatoes","Turnips"]

We can call the columns as a list and the index as a list

In [18]:
df.columns

Index(['Meadow', 'Pigs', 'Cows', 'Potatoes', 'Turnips'], dtype='object')

What does the index look like?

The column and row headers are python sequences and you can pull out individual lables the same way you would access part of the python list.  Check the fourth item in columns and the fourth item in the index:

    df.columns[3]
    
    df.index[3]

You can re-name the index to more informative values - the field names  

    df.set_index('Meadow', inplace=True)

what happens if you don't set inplace=True?

We can pick out individual values by specifiying column and index

How many Cows in Meadow Hen_cae?  

    df.at['Hen_cae','Cows']

change this, adding a cow to the Hen_cae.  Use

    df.at[index, column] = 

### Trickier. 
Downloading data from a website and opening it.  
NCBI has a list of sequence genomes and their assembly metrics:
    
    INTRODUCTION
------------- 
species_genome_size.txt.gz provides the expected genome size for each species 
taxid with at least four assemblies in GenBank. The expected genome size range 
is used to identify outliers for a species that can result from errors. More
information about how the genome size ranges are calculated can be found
https://www.ncbi.nlm.nih.gov/assembly/help/genome-size-check/. 


The species_genome_size.txt.gz file has 5 tab-delimited columns. 
Header rows begin with '#".

Column  1: species_taxid
   Taxonomic identifier of each species 
  
Column  2: min_ungapped_length
   Minimum expected ungapped genome size of an assembly for the species 
   
Column  3: max_ungapped_length
   Maximum expected ungapped genome size of an assembly for the species 
   
Column  4: expected_ungapped_length
   Median genome assembly size of assemblies for the species 

Column  5: number_of_genomes
   Number of genomes used to calculate the expected size range
   

We will down load the file (using wget), unzip it (using gunzip), check it's structure (using head) and read it in.

Using "!" at the start of a line informs the notebook that the following is bash and not python

In [7]:
! wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/species_genome_size.txt.gz

--2023-08-04 11:11:16--  https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/species_genome_size.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.228, 130.14.250.13
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59264 (58K) [application/x-gzip]
Saving to: ‘species_genome_size.txt.gz’


2023-08-04 11:11:17 (267 KB/s) - ‘species_genome_size.txt.gz’ saved [59264/59264]



Unzip the datafile

In [8]:
! gunzip species_genome_size.txt.gz

In [None]:
Use "! ls" to check what files you have now

Check the data file's format using head as a bash command  

    ! head -3 my_file

Read the data in. Which pd.read format would be best?

Check that pandas has interpreted the type of data correctly using

    dtypes
    df.info()

#### What is the species with the most sequenced genomes?



df.max()['column'] will give you the maxiumum value for a column, but you want to also know the species id for this value.  

Try sorting the whole dataframe using 

    df.sort_values(by=['Column_name'])  
    
You can use 

    .tail(N)
    
to show just N rows

You can use 

    ascending=False
  
to show the highest values at the top

Make a quick histogram of the number of genomes using:

    sns.boxplot(x=df["number_of_genomes"])

You can check the identity of any genome at NCBI:  

    https://www.ncbi.nlm.nih.gov/assembly/    

search by txid[species_taxid]   
txid[3707] is mustard, Brassica juncea  

Not all data is as tidy as the NCBI download.  Sometimes you need to exclude lines when you read in data.

### Tricky file formats

Open the excel sheet of data on agricultural productivy in the UK since 1973.  
This is from:
    
   https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1004686/AUK-Chapter5-13jul23.ods
        
This was originally in ods format file. It COULD be read in directly by installing the odf engine, but to save complictions I've provided it as an excel file in Datasets.  

    ../Datasets/AUK-Chapter5-13jul23.xlsx
    
Read it in and check format with  
    
    df.head(10)

At least it reads in, but it's clearly not right - the first few rows are not data.  We need to exclude them, here we do this by setting as header the first line of data.

Specify which of the input rows should be the header with 
    
    header = N   
    
Remember that pandas is 0-indexed.

We could also skip rows using 

    skiprows=N
    
What should N be here?

The end of the file is untidy as well - have a look with   
    
    df.tail(10)

Use  

    skipfooter=
    
To tidy this up

Check the dataframe is as you expect using df.info()

Empty rows are automatically skipped (compare the excel file with what is read in), but rows which are moslty empty are filled with NaN.  We can remove these.  
(inplace=True) makes the change happend on the original dataframe  

    df.dropna(inplace=True)

or drop just the columns where there are missing values  

    df.dropna(axis='columns',inplace=True)  
    
or the rows with more than 2 missing values  

    df.dropna(thresh=2,inplace=True)  
    
or rows with missing values in specific columns  

    df.dropna(subset=[1983, 1997],inplace=True)

To examine this data it makes more sense to have outputs as columns.  We can transpose the dataframe using 

    df.T  
    
use 

    df.head(N) 
    
to check this has worked.

Now we have the years as the row names, and the types of output as row one.  
How to fix this and make the types of output column headers?  



    iloc[N]  

gives row N values as a list.  We can use this to specify new column names.

    df.columns = df.iloc[0]

Check with df.head

We can now drop the redundant row 1.  There are lots of ways to do this!  

Drop by index name of row  

    df.drop(['Unnamed: 0'])  
    
Drop by index range  

    df[1:]  
    
Drop by index location

    df.drop(df.index[0])

It will be useful to have the years as a column, not an index

Make the index into a new column using  

    df.rest_index().  
    
Check it's worked using

    df.head(3)

We also need to rename the new column. Specify which column to work on by putting the index in the square brackets and specify the new name.

    df.columns.values[ ] = "New_Name"

Now we can check what we have.  Use

    df.info()
    
to see what the columns are and which are numeric

All our data columns are coded as objects now (due to the text in the column names when they were row1).  We can fix this by applying pd.to_numeric across the dataframe  
    
    df = df.apply(pd.to_numeric)  
    
Check this has worked wiht   

    df.info()

Now we can plot changes in agricultural output since 1973!

Use sns.lineplot to look at changes in any agricultural output.  

    sns.lineplot(x=df['Year'], y=df['plants and flowers'])
    
Use list(df.columns) to see your options for plotting

Can you see the effect of the foot and mouth outbreak?  the green revolution in grain productivity, the switch to imported fruit and vegetables?

### Saving dataframes

Save as csv

In [227]:
df.to_csv('UK_agriculture.csv', index=False)

Save as excel

In [228]:
df.to_excel('UK_agriculture.xlsx', index=False)

What happens if you leave off 'index=False'?

#### Pickling  
Sometime you are working solely within python and want to preserve the syntax exactly rather than writing to a flat csv or excel file




In [59]:
df.to_pickle('UK_crops.pkl')

Reading from pickle

In [60]:
df = pd.read_pickle('UK_crops.pkl')

Checking contents

In [61]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 49 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Production(2015=100)  65 non-null     object 
 1   1973                  64 non-null     float64
 2   1974                  64 non-null     float64
 3   1975                  64 non-null     float64
 4   1976                  64 non-null     float64
 5   1977                  64 non-null     float64
 6   1978                  64 non-null     float64
 7   1979                  64 non-null     float64
 8   1980                  64 non-null     float64
 9   1981                  64 non-null     float64
 10  1982                  64 non-null     float64
 11  1983                  64 non-null     float64
 12  1984                  64 non-null     float64
 13  1985                  64 non-null     float64
 14  1986                  64 non-null     float64
 15  1987                  64 