# Convert to count matrix
This command is essentially creating a simplified count matrix by:
1. Extracting only necessary columns (gene IDs and counts)
2. Cleaning up sample names
3. Reformatting to a tab-delimited file
4. Making it suitable for downstream analysis in R/DESeq2

Details explanation
1. `cat data/sacCer.featureCounts.tsv`
- Reads the content of the featureCounts output file
- featureCounts typically outputs a tab-separated file with gene counts

2. `awk '(NR>1) {printf "%s ", $1; for (i=7; i<=NF; i++) printf "%s ", $i; print ""}'`
- `NR>1`: Skips the header line
- `printf "%s ", $1`: Prints the first column (gene IDs)
- `for (i=7; i<=NF; i++)`: Loops through columns starting from 7th to last column
- These columns contain the actual count values
- First 6 columns in featureCounts output typically contain gene information (ID, Chr, Start, End, Strand, Length)

3. `sed s/bam//g`
- Removes 'bam' from sample names
- Cleans up file names

4. `tr -d "/"` 
- Removes forward slashes from the output
- Further cleans file paths

5. `tr -d "."` 
- Removes dots from the output
- Additional filename cleaning

6. `tr " " "\t"`
- Converts spaces to tabs
- Makes the output tab-delimited

7. `> data/sacCer_counts_raw.tsv`
- Saves the processed output to a new file
- Creates a clean count matrix with just gene IDs and counts

Example transformation:
```
Original featureCounts output:
GeneID  Chr  Start  End  Strand  Length  ./sample1.bam  ./sample2.bam
YDL248W  chr4  1802  2953  +  1152  45  67

After processing:
YDL248W  45  67
```

In [1]:
%%time
!cat data/gene_counts.tsv | awk '(NR>1) {printf "%s ", $1; for (i=7; i<=NF; i++) printf "%s ", $i; print ""}' | sed s/bam//g | tr -d "/" | tr -d "." | tr " " "\t" > data/gene_counts_raw.tsv

CPU times: user 2.4 ms, sys: 2.06 ms, total: 4.47 ms
Wall time: 261 ms


In [2]:
import pandas as pd

df_gene_count = pd.read_csv("data/gene_counts_raw.tsv", sep="\t")
print(df_gene_count.shape)
df_gene_count.head()

(57820, 8)


Unnamed: 0,Geneid,_filesTien_Cont1_S25sorted,_filesTien_Cont2_S26sorted,_filesTien_Cont3_S27sorted,_filesTien_Zn1_S28sorted,_filesTien_Zn2_S29sorted,_filesTien_Zn3_S30sorted,Unnamed: 7
0,ENSG000002239724,0,4,1,0,5,0,
1,ENSG000002272324,34,531,758,632,558,496,
2,ENSG000002434852,0,1,2,0,0,1,
3,ENSG000002376132,0,0,0,0,0,0,
4,ENSG000002680202,0,0,0,0,0,0,


In [3]:
# drop last column
df_gene_count = df_gene_count.drop(df_gene_count.columns[-1], axis=1)
df_gene_count.head()

Unnamed: 0,Geneid,_filesTien_Cont1_S25sorted,_filesTien_Cont2_S26sorted,_filesTien_Cont3_S27sorted,_filesTien_Zn1_S28sorted,_filesTien_Zn2_S29sorted,_filesTien_Zn3_S30sorted
0,ENSG000002239724,0,4,1,0,5,0
1,ENSG000002272324,34,531,758,632,558,496
2,ENSG000002434852,0,1,2,0,0,1
3,ENSG000002376132,0,0,0,0,0,0
4,ENSG000002680202,0,0,0,0,0,0


In [4]:
# remove last character from Geneid column
df_gene_count["Geneid"] = df_gene_count["Geneid"].str[:-1]
df_gene_count.head()

Unnamed: 0,Geneid,_filesTien_Cont1_S25sorted,_filesTien_Cont2_S26sorted,_filesTien_Cont3_S27sorted,_filesTien_Zn1_S28sorted,_filesTien_Zn2_S29sorted,_filesTien_Zn3_S30sorted
0,ENSG00000223972,0,4,1,0,5,0
1,ENSG00000227232,34,531,758,632,558,496
2,ENSG00000243485,0,1,2,0,0,1
3,ENSG00000237613,0,0,0,0,0,0
4,ENSG00000268020,0,0,0,0,0,0


In [5]:
# save the gene count data
df_gene_count.to_csv("data/gene_counts_raw.tsv", sep="\t", index=False)

# Load and saved gene mapped

In [2]:
import pandas as pd

df_gene = pd.read_csv("data/gene_mapped_counts_raw.csv")
print(df_gene.shape)
df_gene.head()

(51757, 8)


Unnamed: 0.1,Unnamed: 0,Geneid,X_filesTien_Cont1_S25sorted,X_filesTien_Cont2_S26sorted,X_filesTien_Cont3_S27sorted,X_filesTien_Zn1_S28sorted,X_filesTien_Zn2_S29sorted,X_filesTien_Zn3_S30sorted
0,2,TNMD,71,945,1295,815,729,496
1,3,DPM1,185,2383,3464,3081,2884,2011
2,4,SCYL3,29,576,706,623,557,351
3,6,FGR,2,22,43,37,27,22
4,8,FUCA2,142,1528,2270,1692,1542,1078


In [3]:
# drop 1st column
df_gene = df_gene.drop(df_gene.columns[0], axis=1)
df_gene.head()

Unnamed: 0,Geneid,X_filesTien_Cont1_S25sorted,X_filesTien_Cont2_S26sorted,X_filesTien_Cont3_S27sorted,X_filesTien_Zn1_S28sorted,X_filesTien_Zn2_S29sorted,X_filesTien_Zn3_S30sorted
0,TNMD,71,945,1295,815,729,496
1,DPM1,185,2383,3464,3081,2884,2011
2,SCYL3,29,576,706,623,557,351
3,FGR,2,22,43,37,27,22
4,FUCA2,142,1528,2270,1692,1542,1078


In [4]:
# save the gene count data
df_gene.to_csv("data/gene_mapped_counts_raw.tsv", sep="\t", index=False)