# Data exploration

## Protein-protein interaction networks (ppi_data)

### List all taxids in data set.

In [7]:
%%bash
cat <(tail -n +2  ../data/raw/ppi_data/hpidb2_March14_2017_mitab.txt | cut -f10) <(tail -n +2 ../data/raw/ppi_data/hpidb2_March14_2017_mitab.txt | cut -f11) | sed -E 's/taxid:([[:digit:]]+).*/\1/g' | sort -u

10029
10036
10090
10116
10141
10243
10245
10249
10254
10269
10273
10280
10298
10299
10304
10306
10308
10310
10313
10315
10320
10323
10324
10326
10329
10335
10345
10359
10360
10363
10366
10367
10370
10376
10377
10383
10384
10385
10389
10390
103904
103929
10407
10450
10497
10498
10515
10527
10529
10530
10533
10553
10559
10560
10579
10580
10583
10585
10586
10587
10588
10593
10595
10598
10600
10602
10614
10617
10621
10623
10632
10635
10636
10637
10641
10643
106820
10684
10710
10714
10884
10886
10900
10915
10970
11033
11034
11036
11041
11043
11044
11045
11049
11053
11059
11060
11065
11070
11072
11077
11078
11082
11084
11085
11096
11097
11098
11099
11100
11103
11104
11105
11108
11113
11116
1111708
11122
11142
11167
11171
11195
11198
11208
11212
11214
11215
11232
11233
11234
11235
11241
11250
11259
11273
11276
11284
11285
1140
11484
11553
11589
11590
11599
11602
11608
11619
11622
11624
11628
11641
11660
11665
11670
11673
11676
11678
11679
1168
11685
11686
11689
11691
11696
11698
11706
11707
1

### Number of columns in data set

In [15]:
%%bash
head -1 ../data/raw/ppi_data/hpidb2_March14_2017_mitab_plus.txt | awk -F'\t' '{print NF }'

26


What if there are some lines with more/fewer entries?

In [13]:
%%bash
awk -F'\t' '{print NF}' ../data/raw/ppi_data/hpidb2_March14_2017_mitab_plus.txt | sort -nu | tail -n 1

26


Listing column names

In [16]:
%%bash
awk -F'\t' ' { for (i = 1; i <= NF; ++i) print i, $i; exit } ' ../data/raw/ppi_data/hpidb2_March14_2017_mitab_plus.txt

1 # protein_xref_1
2 protein_xref_2
3 alternative_identifiers_1
4 alternative_identifiers_2
5 protein_alias_1
6 protein_alias_2
7 detection_method
8 author_name
9 pmid
10 protein_taxid_1
11 protein_taxid_2
12 interaction_type
13 source_database_id
14 database_identifier
15 confidence
16 protein_xref_1_unique
17 protein_xref_2_unique
18 protein_taxid_1_cat
19 protein_taxid_2_cat
20 protein_taxid_1_name
21 protein_taxid_2_name
22 protein_seq1
23 protein_seq2
24 source_database
25 protein_xref_1_display_id
26 protein_xref_2_display_id


### Retrieve data sources

In [1]:
%%bash
tail -n +2 ../data/raw/ppi_data/hpidb2_March14_2017_mitab_plus.txt | cut -f1 | sed -r 's/(^.*):.*/\1/g' | sort -u

ensembl
entrez gene/locuslink
intact
uniprotkb


### Number of Entrez genes

In [2]:
%%bash
cat <(cut -f1 ../data/raw/ppi_data/hpidb2_March14_2017_mitab_plus.txt) <(cut -f2 ../data/raw/ppi_data/hpidb2_March14_2017_mitab_plus.txt) | grep entrez | sort -u | sed -rn 's/^.*:(.*)/\1/p' | wc -l

3044


## Check overlap between data sets
First column shows unique values for Phisto, second for HPIDB2 and third is shared.

In [2]:
%%bash
comm <(cut -f3 -d, ../data/raw/ppi_data/phisto_Jan19_2017.csv | sed 's/"//g' | sort -u ) <(cut -f2 ../data/raw/ppi_data/hpidb2_March14_2017_mitab.txt | sed s/uniprotkb://g | sort -u)

	A0A024A2C9
	A0A088QCN4
	A0A088QCP6
	A0A088QCP8
	A0A088QCQ2
	A0A088QCQ8
	A0A088QCS4
	A0A088QCS8
	A0A088QCT3
	A0A088QCT7
	A0A088QCU2
	A0A088QCU5
	A0A088QD14
	A0A088QD18
	A0A088QD22
	A0A088QD27
	A0A088QD33
	A0A088QD42
	A0A088QD46
	A0A088QD54
	A0A088QD58
	A0A088QD59
	A0A088QD63
	A0A088QD64
	A0A088QD73
	A0A088QD79
	A0A088QD82
	A0A088QD87
	A0A088QD91
	A0A088QD96
	A0A088QF47
	A0A088QF49
	A0A088QF61
	A0A088QF67
	A0A088QF78
	A0A088QF82
	A0A088QF87
	A0A088QF89
	A0A088QF94
	A0A088QF98
	A0A088QTU0
	A0A088QTU4
	A0A088QTV3
	A0A088QTW6
	A0A088QTX5
	A0A088QTX9
	A0A088QTY6
	A0A089NDG7
	A0A0C7TPJ3
	A0A0F6B063
	A0A0F6B1Q8
	A0A0F7R416
	A0A0F7R9M2
	A0A0F7R9R7
	A0A0F7RBN8
	A0A0F7RDC0
	A0A0F7REZ1
	A0A0F7RG28
	A0A0F7RG58
	A0A0F7RI40
	A0A0F7RKT3
	A0A0H2VBI7
	A0A0H2X3D0
	A0A0H2ZTM2
	A0A0H3A1L5
	A0A0H3A2T3
	A0A0H3JFN8
	A0A0H3JGR6
	A0A0H3LYM0
	A0A0H3NA16
	A0A0H3NB75
	A0A0H3NDL6
	A0A0H3NF38
	A0A0H3NG92
	A0A0H3NGI8
	A0A0H3NGY5
	A0A0H3NMJ6
	A0A142I9X8
	A0A173DS53
	A0A1A9IFF4
	A0A1A9IJH2
	A0MPS7
	A1S3N8
		A1Z0Q5
	A2