# Try Out pandas: Examine A GTF File

### Let's Get Situated

Read the GTF file into a pandas DataFrame. We've cut down this GTF file so that it only contains features found on chromosome 12.

In [2]:
import pandas as pd

In [3]:
# Instead of creating a new DataFrame from scratch,
# we're going to read in the actual GTF file
chr12_gtf = pd.read_csv('genes_ucsc.chr12.mod.gtf')

In [4]:
# Oh man, that first attempt looked bad (why?). Let's try again
chr12_gtf = pd.read_csv('genes_ucsc.chr12.mod.gtf', sep="\t", names=['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attributes'])

### Exercise 1: Get a feel for the DataFrame

- Look at the top 4 lines of the GTF file.
- Get the dimensions (how many rows and columns) of the GTF file.
- Find the `dtypes` of the columns.

In [5]:
# print the top 4 lines of the GTF file
print(chr12_gtf.head(4))

# print the dimensions (row and columns) of the file
print(chr12_gtf.shape)

# print the datatypes (dtypes) of the columns
print(chr12_gtf.dtypes)

         seqname   source feature  start    end score strand frame  \
0  chr12_partial  unknown    exon      1    892     .      +     0   
1  chr12_partial  unknown     CDS    162    892     .      +     0   
2          chr12  unknown    exon  43757  43793     .      +     0   
3          chr12  unknown     CDS  43757  43793     .      +     0   

                                          attributes  
0  gene_id "ATXN2_partial"; gene_name "ATXN2_part...  
1  gene_id "ATXN2_partial"; gene_name "ATXN2_part...  
2  gene_id "ATXN2"; gene_name "ATXN2"; p_id "P137...  
3  gene_id "ATXN2"; gene_name "ATXN2"; p_id "P137...  
(47113, 9)
seqname       object
source        object
feature       object
start          int64
end            int64
score         object
strand        object
frame         object
attributes    object
dtype: object


### Exercise 2: Extract some biological information from the DataFrame

- Look at the top 10 entries in the feature column of the GTF file.
- How many entries are there for each kind of "feature" in this GTF file?

In [6]:
# print the top 10 entries of the feature column
print(chr12_gtf["feature"].head(10))

# print the number of entries there are for each feature column
print(chr12_gtf["feature"].value_counts())

0    exon
1     CDS
2    exon
3     CDS
4    exon
5     CDS
6    exon
7     CDS
8    exon
9     CDS
Name: feature, dtype: object
exon           23203
CDS            20284
start_codon     1818
stop_codon      1808
Name: feature, dtype: int64


### Exercise 3: Let's think about DataFrames as a data structure

- What are some types of data you would store in a DataFrame?
- What are some types of data you wouldn't store in a DataFrame?

###### What are some types of data you would store in a DataFrame?
The data types I would store in a DataFrame would be an indexing column that would distinguish the rest of the data (for example, a name). Other types of data I would store would be 

###### What are some types of data you wouldn't store in a DataFrame?
Data I wouldn't store in a DataFrame would be extraneous data that doesn't correlate to the dataset.

### Exercise 4: Perform operations on the DataFrame

- Find the length of each feature in the gtf file by i. subtracting end coordinate from start coordinate.
- Time this operation

In [7]:
import time
start = time.time()

# find the length of each feature in the gtf file by subtracting end coordinate from start coordinate
print(chr12_gtf["end"] - chr12_gtf["start"])

end = time.time()
print("Total time was: " + str(end - start))

0         891
1         730
2          36
3          36
4          59
5          59
6          71
7          71
8         150
9         150
10        124
11        124
12         91
13         91
14        197
15        197
16        178
17        178
18        209
19        209
20        182
21        182
22        197
23        197
24        107
25        107
26         70
27         70
28        304
29        304
         ... 
47083    4967
47084    4967
47085       2
47086       2
47087       2
47088       2
47089       2
47090       2
47091       2
47092       2
47093       2
47094     659
47095       2
47096      83
47097      84
47098      84
47099     128
47100     128
47101     120
47102     120
47103     216
47104     216
47105     123
47106     123
47107     127
47108     127
47109     248
47110     270
47111       2
47112      29
Length: 47113, dtype: int64
Total time was: 0.05898165702819824


### Exercise 5: Get information about exons from the DataFrame

- a. Get all the exons in the gtf file.
- b. For the exons, pull out the following columns:  seqname, start, end and attributes.
- c. Try adding the .values attribute on the output from (b). It looks different from the (a) output. Why?

In [8]:
# get all exons
exons = chr12_gtf["feature"] == "exon"
exons_gtf = chr12_gtf[exons]

# pull out the seqname, start, end, and attributes columns
specific_exon_columns = exons_gtf[["seqname", "start", "end", "attributes"]]

# output the specific exon columns as an Array
specific_exon_columns.values

"""
The reason the .values attribute makes the specific_exon_columns non-tabular and an Array
is because it returns an Array version of the Pandas DataFrame object.
"""

'\nThe reason the .values attribute makes the specific_exon_columns non-tabular and an Array\nis because it returns an Array version of the Pandas DataFrame object.\n'

### Exercise 6: Get information about the gene ATXN2

- Find all the rows where the `attributes` column contains `ATXN2` as a gene name
- Narrow that down to rows where the `feature` is also a `CDS`
- Sort these records by start coordinate

**Note:** This is a little more involved. You should use Google/DuckDuckGo/ESP to search for hints on how to do this. Searching this stuff online is how we learn to do these things.

In [9]:
# grab only ATXN2 and CDS data-sets
atxn2 = (chr12_gtf["attributes"].str.lower().str.contains("atxn2")) & (chr12_gtf["feature"] == "CDS")

# sort the atxn2 gene records by start coordinate
atxn2_gtf = chr12_gtf[atxn2]
atxn2_gtf = atxn2_gtf.sort_values(by=["start"])

# return the atxn2 gtf DataFrame
atxn2_gtf

Unnamed: 0,seqname,source,feature,start,end,score,strand,frame,attributes
1,chr12_partial,unknown,CDS,162,892,.,+,0,"gene_id ""ATXN2_partial""; gene_name ""ATXN2_part..."
3,chr12,unknown,CDS,43757,43793,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
5,chr12,unknown,CDS,45459,45518,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
7,chr12,unknown,CDS,46699,46770,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
9,chr12,unknown,CDS,47246,47396,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
11,chr12,unknown,CDS,74360,74484,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
13,chr12,unknown,CDS,78703,78794,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
15,chr12,unknown,CDS,79600,79797,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
17,chr12,unknown,CDS,81249,81427,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
19,chr12,unknown,CDS,83313,83522,.,+,0,"gene_id ""ATXN2""; gene_name ""ATXN2""; p_id ""P137..."
