# Mots anglais dans le protéome humain

L'objectif de ce premier projet est de découvrir si des mots anglais peuvent se retrouver dans les séquences du protéome humain, c'est-à-dire dans les séquences de l'ensemble des protéines humaines.
Composition aminée

Dans un premier temps, composez 5 mots anglais avec les 20 acides aminés.

## Des mots

Téléchargez le fichier [english-common-words.txt](https://python.sdv.univ-paris-diderot.fr/data-files/english-common-words.txt). Ce fichier contient les 3000 mots anglais les plus fréquents, à raison d'1 mot par ligne.

Dans ce notebook, écrivez la fonction `read_words()` qui va lire les mots contenus dans le fichier dont le nom est fourni en argument du script et renvoyer une liste contenant les mots convertis en majuscule et composés de 3 caractères ou plus.

Dans le programme principal, affichez le nombre de mots sélectionnés.



In [5]:
!wget https://python.sdv.univ-paris-diderot.fr/data-files/english-common-words.txt

--2022-01-21 14:42:20--  https://python.sdv.univ-paris-diderot.fr/data-files/english-common-words.txt
Resolving python.sdv.univ-paris-diderot.fr (python.sdv.univ-paris-diderot.fr)... 2001:660:3301:8003::54, 194.254.61.54
Connecting to python.sdv.univ-paris-diderot.fr (python.sdv.univ-paris-diderot.fr)|2001:660:3301:8003::54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22224 (22K) [text/plain]
Saving to: ‘english-common-words.txt’


2022-01-21 14:42:21 (6.17 MB/s) - ‘english-common-words.txt’ saved [22224/22224]



In [64]:
def read_words(file_path):
    with open(file_path, 'r') as file:
        words =  file.readlines()
        return [word.upper().replace('\n', '') for word in words if len(word) > 3]

In [65]:
file_path = 'english-common-words.txt'

words = read_words(file_path)


## Des protéines

Téléchargez maintenant le fichier [human-proteome.fasta](https://python.sdv.univ-paris-diderot.fr/data-files/human-proteome.fasta). Attention, ce fichier est assez gros. Ce fichier provient de la banque de données UniProt à partir de cette page.

*NB : Le code dans la cellule ci-dessous télécharge directement le fichier dans le même dossier que ce notebook, si vous êtes sur GNU/Linux ou MacOS, et si wget est installé)*

In [66]:
!wget https://python.sdv.univ-paris-diderot.fr/data-files/human-proteome.fasta

--2022-01-21 15:16:57--  https://python.sdv.univ-paris-diderot.fr/data-files/human-proteome.fasta
Resolving python.sdv.univ-paris-diderot.fr (python.sdv.univ-paris-diderot.fr)... 2001:660:3301:8003::54, 194.254.61.54
Connecting to python.sdv.univ-paris-diderot.fr (python.sdv.univ-paris-diderot.fr)|2001:660:3301:8003::54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13615791 (13M)
Saving to: ‘human-proteome.fasta.1’


2022-01-21 15:17:05 (1.90 MB/s) - ‘human-proteome.fasta.1’ saved [13615791/13615791]



Toujours dans ce notebook, écrivez la fonction `read_sequences()` qui va lire le protéome dans le fichier dont le nom est fourni en second argument du script. Cette fonction va renvoyer un dictionnaire dont les clefs sont les identifiants des protéines (par exemple, O95139, O75438, Q8N4C6) et dont les valeurs associées sont les séquences.

Dans le programme principal, affichez le nombre de séquences lues. À des fins de test, affichez également la séquence associée à la protéine O95139.

In [40]:
def read_sequences(file_path):
    with open(file_path) as file:
        lines = file.readlines()
        identifiants = []
        seq = ''
        sequences = []
        for line in lines:
            if line.startswith('>sp|'):
                if seq:
                    sequences.append(seq)
                identifiants.append(line[4:10])
                seq = ''
            else:
                seq += line.replace('\n', '')
        sequences.append(seq) # ajout de la dernière séquence du fichier
        return dict(zip(identifiants, sequences))
     
        
            

In [42]:
proteome_path = "human-proteome.fasta"
sequences = read_sequences(proteome_path)

In [43]:
len(sequences)

20143

In [47]:
sequences.get('O95139')

'MTGYTPDEKLRLQQLRELRRRWLKDQELSPREPVLPPQKMGPMEKFWNKFLENKSPWRKMVHGVYKKSIFVFTHVLVPVWIIHYYMKYHVSEKPYGIVEKKSRIFPGDTILETGEVIPPMKEFPDQHH'

## À la pêche aux mots

Écrivez maintenant la fonction `search_words_in_proteome()` qui prend en argument la liste de mots et le dictionnaire contenant les séquences des protéines et qui va compter le nombre de séquences dans lesquelles un mot est présent. Cette fonction renverra un dictionnaire dont les clefs sont les mots et les valeurs le nombre de séquences qui contiennent ces mots. La fonction affichera également le message suivant pour les mots trouvés dans le protéome :

```
ACCESS found in 1 sequences
ACID found in 38 sequences
ACT found in 805 sequences
[...]
```


In [68]:
def search_words_in_proteome(words, sequences):
    result_dict = {}
    
    for word in words:
        #print(f"Looking at {word}…")
        for seq in sequences.values():
            if word in seq:
                result_dict[word] = result_dict.get(word, 0) + 1
    
    
    for word, num in result_dict.items():    
        print(f"{word} found in {num} sequences")
    
    return result_dict

In [69]:
rd = search_words_in_proteome(words, sequences)

Looking at ABANDON…
Looking at ABILITY…
Looking at ABLE…
Looking at ABORTION…
Looking at ABOUT…
Looking at ABOVE…
Looking at ABROAD…
Looking at ABSENCE…
Looking at ABSOLUTE…
Looking at ABSOLUTELY…
Looking at ABSORB…
Looking at ABUSE…
Looking at ACADEMIC…
Looking at ACCEPT…
Looking at ACCESS…
Looking at ACCIDENT…
Looking at ACCOMPANY…
Looking at ACCOMPLISH…
Looking at ACCORDING…
Looking at ACCOUNT…
Looking at ACCURATE…
Looking at ACCUSE…
Looking at ACHIEVE…
Looking at ACHIEVEMENT…
Looking at ACID…
Looking at ACKNOWLEDGE…
Looking at ACQUIRE…
Looking at ACROSS…
Looking at ACT…
Looking at ACTION…
Looking at ACTIVE…
Looking at ACTIVIST…
Looking at ACTIVITY…
Looking at ACTOR…
Looking at ACTRESS…
Looking at ACTUAL…
Looking at ACTUALLY…
Looking at ADAPT…
Looking at ADD…
Looking at ADDITION…
Looking at ADDITIONAL…
Looking at ADDRESS…
Looking at ADEQUATE…
Looking at ADJUST…
Looking at ADJUSTMENT…
Looking at ADMINISTRATION…
Looking at ADMINISTRATOR…
Looking at ADMIRE…
Looking at ADMISSION…
Lookin

Looking at CHARGE…
Looking at CHARITY…
Looking at CHART…
Looking at CHASE…
Looking at CHEAP…
Looking at CHECK…
Looking at CHEEK…
Looking at CHEESE…
Looking at CHEF…
Looking at CHEMICAL…
Looking at CHEST…
Looking at CHICKEN…
Looking at CHIEF…
Looking at CHILD…
Looking at CHILDHOOD…
Looking at CHINESE…
Looking at CHIP…
Looking at CHOCOLATE…
Looking at CHOICE…
Looking at CHOLESTEROL…
Looking at CHOOSE…
Looking at CHRISTIAN…
Looking at CHRISTMAS…
Looking at CHURCH…
Looking at CIGARETTE…
Looking at CIRCLE…
Looking at CIRCUMSTANCE…
Looking at CITE…
Looking at CITIZEN…
Looking at CITY…
Looking at CIVIL…
Looking at CIVILIAN…
Looking at CLAIM…
Looking at CLASS…
Looking at CLASSIC…
Looking at CLASSROOM…
Looking at CLEAN…
Looking at CLEAR…
Looking at CLEARLY…
Looking at CLIENT…
Looking at CLIMATE…
Looking at CLIMB…
Looking at CLINIC…
Looking at CLINICAL…
Looking at CLOCK…
Looking at CLOSE…
Looking at CLOSELY…
Looking at CLOSER…
Looking at CLOTHES…
Looking at CLOTHING…
Looking at CLOUD…
Looking at

Looking at EASTERN…
Looking at EASY…
Looking at EAT…
Looking at ECONOMIC…
Looking at ECONOMICS…
Looking at ECONOMIST…
Looking at ECONOMY…
Looking at EDGE…
Looking at EDITION…
Looking at EDITOR…
Looking at EDUCATE…
Looking at EDUCATION…
Looking at EDUCATIONAL…
Looking at EDUCATOR…
Looking at EFFECT…
Looking at EFFECTIVE…
Looking at EFFECTIVELY…
Looking at EFFICIENCY…
Looking at EFFICIENT…
Looking at EFFORT…
Looking at EGG…
Looking at EIGHT…
Looking at EITHER…
Looking at ELDERLY…
Looking at ELECT…
Looking at ELECTION…
Looking at ELECTRIC…
Looking at ELECTRICITY…
Looking at ELECTRONIC…
Looking at ELEMENT…
Looking at ELEMENTARY…
Looking at ELIMINATE…
Looking at ELITE…
Looking at ELSE…
Looking at ELSEWHERE…
Looking at E-MAIL…
Looking at EMBRACE…
Looking at EMERGE…
Looking at EMERGENCY…
Looking at EMISSION…
Looking at EMOTION…
Looking at EMOTIONAL…
Looking at EMPHASIS…
Looking at EMPHASIZE…
Looking at EMPLOY…
Looking at EMPLOYEE…
Looking at EMPLOYER…
Looking at EMPLOYMENT…
Looking at EMPTY…


Looking at HOMELESS…
Looking at HONEST…
Looking at HONEY…
Looking at HONOR…
Looking at HOPE…
Looking at HORIZON…
Looking at HORROR…
Looking at HORSE…
Looking at HOSPITAL…
Looking at HOST…
Looking at HOT…
Looking at HOTEL…
Looking at HOUR…
Looking at HOUSE…
Looking at HOUSEHOLD…
Looking at HOUSING…
Looking at HOW…
Looking at HOWEVER…
Looking at HUGE…
Looking at HUMAN…
Looking at HUMOR…
Looking at HUNDRED…
Looking at HUNGRY…
Looking at HUNTER…
Looking at HUNTING…
Looking at HURT…
Looking at HUSBAND…
Looking at HYPOTHESIS…
Looking at ICE…
Looking at IDEA…
Looking at IDEAL…
Looking at IDENTIFICATION…
Looking at IDENTIFY…
Looking at IDENTITY…
Looking at IGNORE…
Looking at ILL…
Looking at ILLEGAL…
Looking at ILLNESS…
Looking at ILLUSTRATE…
Looking at IMAGE…
Looking at IMAGINATION…
Looking at IMAGINE…
Looking at IMMEDIATE…
Looking at IMMEDIATELY…
Looking at IMMIGRANT…
Looking at IMMIGRATION…
Looking at IMPACT…
Looking at IMPLEMENT…
Looking at IMPLICATION…
Looking at IMPLY…
Looking at IMPORTAN

Looking at MUTUAL…
Looking at MYSELF…
Looking at MYSTERY…
Looking at MYTH…
Looking at NAKED…
Looking at NAME…
Looking at NARRATIVE…
Looking at NARROW…
Looking at NATION…
Looking at NATIONAL…
Looking at NATIVE…
Looking at NATURAL…
Looking at NATURALLY…
Looking at NATURE…
Looking at NEAR…
Looking at NEARBY…
Looking at NEARLY…
Looking at NECESSARILY…
Looking at NECESSARY…
Looking at NECK…
Looking at NEED…
Looking at NEGATIVE…
Looking at NEGOTIATE…
Looking at NEGOTIATION…
Looking at NEIGHBOR…
Looking at NEIGHBORHOOD…
Looking at NEITHER…
Looking at NERVE…
Looking at NERVOUS…
Looking at NET…
Looking at NETWORK…
Looking at NEVER…
Looking at NEVERTHELESS…
Looking at NEW…
Looking at NEWLY…
Looking at NEWS…
Looking at NEWSPAPER…
Looking at NEXT…
Looking at NICE…
Looking at NIGHT…
Looking at NINE…
Looking at NOBODY…
Looking at NOD…
Looking at NOISE…
Looking at NOMINATION…
Looking at NONE…
Looking at NONETHELESS…
Looking at NOR…
Looking at NORMAL…
Looking at NORMALLY…
Looking at NORTH…
Looking at 

Looking at REALITY…
Looking at REALIZE…
Looking at REALLY…
Looking at REASON…
Looking at REASONABLE…
Looking at RECALL…
Looking at RECEIVE…
Looking at RECENT…
Looking at RECENTLY…
Looking at RECIPE…
Looking at RECOGNITION…
Looking at RECOGNIZE…
Looking at RECOMMEND…
Looking at RECOMMENDATION…
Looking at RECORD…
Looking at RECORDING…
Looking at RECOVER…
Looking at RECOVERY…
Looking at RECRUIT…
Looking at RED…
Looking at REDUCE…
Looking at REDUCTION…
Looking at REFER…
Looking at REFERENCE…
Looking at REFLECT…
Looking at REFLECTION…
Looking at REFORM…
Looking at REFUGEE…
Looking at REFUSE…
Looking at REGARD…
Looking at REGARDING…
Looking at REGARDLESS…
Looking at REGIME…
Looking at REGION…
Looking at REGIONAL…
Looking at REGISTER…
Looking at REGULAR…
Looking at REGULARLY…
Looking at REGULATE…
Looking at REGULATION…
Looking at REINFORCE…
Looking at REJECT…
Looking at RELATE…
Looking at RELATION…
Looking at RELATIONSHIP…
Looking at RELATIVE…
Looking at RELATIVELY…
Looking at RELAX…
Looking 

Looking at SUN…
Looking at SUPER…
Looking at SUPPLY…
Looking at SUPPORT…
Looking at SUPPORTER…
Looking at SUPPOSE…
Looking at SUPPOSED…
Looking at SUPREME…
Looking at SURE…
Looking at SURELY…
Looking at SURFACE…
Looking at SURGERY…
Looking at SURPRISE…
Looking at SURPRISED…
Looking at SURPRISING…
Looking at SURPRISINGLY…
Looking at SURROUND…
Looking at SURVEY…
Looking at SURVIVAL…
Looking at SURVIVE…
Looking at SURVIVOR…
Looking at SUSPECT…
Looking at SUSTAIN…
Looking at SWEAR…
Looking at SWEEP…
Looking at SWEET…
Looking at SWIM…
Looking at SWING…
Looking at SWITCH…
Looking at SYMBOL…
Looking at SYMPTOM…
Looking at SYSTEM…
Looking at TABLE…
Looking at TABLESPOON…
Looking at TACTIC…
Looking at TAIL…
Looking at TAKE…
Looking at TALE…
Looking at TALENT…
Looking at TALK…
Looking at TALL…
Looking at TANK…
Looking at TAP…
Looking at TAPE…
Looking at TARGET…
Looking at TASK…
Looking at TASTE…
Looking at TAX…
Looking at TAXPAYER…
Looking at TEA…
Looking at TEACH…
Looking at TEACHER…
Looking at

### Et le mot le plus fréquent est...

Pour terminer, écrivez maintenant la fonction `find_most_frequent_word()` qui prend en argument le dictionnaire renvoyé par la précédente fonction `search_words_in_proteome()` et qui affiche le mot trouvé dans le plus de protéines, ainsi que le nombre de séquences dans lesquelles il a été trouvé, sous la forme :

`=> xxx found in yyy sequences`

In [72]:
def find_most_frequent_word(rd):
    max_value = max(rd.values())
    inv_dict = {v:k for k,v in rd.items()}
    res = inv_dict.get(max_value)
    return f"{res} found in {max_value} sequences"

In [73]:
find_most_frequent_word(rd)

'ALL found in 6477 sequences'