Run the following in your Terminal to get the gene names

curl -o ../hgnc_genes.tsv 'https://www.genenames.org/cgi-bin/download/custom?col=gd_app_sym&col=gd_prev_sym&col=gd_aliases&status=Approved&hgnc_dbtag=on&order_by=gd_app_sym_sort&format=text&submit=submit'

In [5]:
from collections import defaultdict
import csv
class GeneValidator:
    def __init__(self, file_path):
        self.gene_symbol_set = set()
        self.alias_map = defaultdict(str)

        with open(file_path, 'r') as file:
            reader = csv.reader(file, delimiter='\t')
            next(reader)  # Skip header row
            for cells in reader:
                if cells[0]:
                    gene = cells[0].upper()
                    self.gene_symbol_set.add(gene)

                    if len(cells) > 1 and cells[1]:
                        previous_symbols = cells[1].split(", ")
                        for symbol in previous_symbols:
                            self.alias_map[symbol.upper()] = gene

                    if len(cells) > 2 and cells[2]:
                        alias_symbols = cells[2].split(", ")
                        for alias in alias_symbols:
                            self.alias_map[alias.upper()] = gene

    def validate_human_genes(self, genes):
        official_genes = set()
        invalid_genes = set()
        updated_genes = {}

        for raw_term in genes:
            # print(f'validate Hugo symbol for {raw_term}')
            term = raw_term.upper()
            if term in self.gene_symbol_set:
                official_genes.add(term)
            elif term in self.alias_map:
                official_gene = self.alias_map[term]
                official_genes.add(official_gene)
                updated_genes[term] = official_gene
                invalid_genes.add(term)
            else:
                invalid_genes.add(term)

        return {
            'official_genes': official_genes,
            'invalid': invalid_genes,
            'updated_genes': updated_genes
        }


In [7]:
# Usage example:
file_path = "./hgnc_genes.tsv"

text = '''
	The significant upregulation of OAS family proteins, including OAS1 (2.309110681 at 24h, 2.327680098 at 48h), OAS2 (3.580865407 at 48h), and OAS3 (3.175544218 at 48h), along with OASL (2.916732074 at 48h), suggests a coordinated activation of the OAS-RNase L pathway in response to Dengue virus infection. This pathway is known to play a crucial role in the innate immune response against viral infections. The 2'-5'-oligoadenylate synthetases (OAS) are interferon-induced enzymes that, when activated by viral double-stranded RNA, produce 2'-5'-oligoadenylates. These oligoadenylates activate RNase L, which degrades viral and cellular RNAs, inhibiting protein synthesis and viral replication. The observed upregulation of multiple OAS family members indicates a robust activation of this antiviral mechanism. Interestingly, the data also shows a significant increase in IFIT family proteins, particularly IFIT1 (3.970117068 at 24h, 6.283329563 at 48h), IFIT2 (3.884688719 at 24h, 6.643829723 at 48h), and IFIT3 (3.494644087 at 24h, 5.69825699 at 48h). These proteins are known to recognize and sequester viral RNA, preventing its translation and replication. We hypothesize that the OAS-RNase L pathway and the IFIT family proteins work synergistically to create a multi-layered defense against Dengue virus. The OAS proteins may be primarily responsible for degrading viral RNA, while the IFIT proteins sequester any remaining viral RNA and prevent its translation. This coordinated response could explain the cell's ability to mount a strong antiviral state. To validate this hypothesis, siRNA knockdown experiments targeting OAS1, OAS2, OAS3, and IFIT1-3 could be performed in Dengue virus-infected cells. The effect on viral replication and cellular RNA degradation could be assessed through qRT-PCR and RNA sequencing. Additionally, overexpression of these proteins in susceptible cell lines could provide insights into their protective effects against Dengue virus infection. This hypothesis not only explains the observed upregulation of multiple antiviral proteins but also suggests potential targets for enhancing the cellular defense against Dengue virus.
'''

total_genes_set = text.split(' ')
validator = GeneValidator(file_path)
result = validator.validate_human_genes(total_genes_set)
print(result)

updated_gene_symbols = list(result['official_genes'])
invalid_gene_symbols = list(result['invalid'])
updated_genes_mapping = result['updated_genes']

print(updated_genes_mapping)

defaultdict(<class 'str'>, {'NCRNA00181': 'A1BG-AS1', 'A1BGAS': 'A1BG-AS1', 'A1BG-AS': 'A1BG-AS1', 'FLJ23569': 'A1BG-AS1', 'ACF': 'A1CF', 'ASP': 'TMPRSS11D', 'ACF64': 'A1CF', 'ACF65': 'A1CF', 'APOBEC1CF': 'A1CF', 'FWP007': 'A2M', 'S863-7': 'A2M', 'CPAMD5': 'A2M', 'CPAMD9': 'A2ML1', 'FLJ25179': 'A2ML1', 'P170': 'A2ML1', 'A2MP': 'A2MP1', 'A3GALT2P': 'A3GALT2', 'IGBS3S': 'A3GALT2', 'IGB3S': 'A3GALT2', 'P1': 'RPLP1', 'A14GALT': 'A4GALT', 'GB3S': 'A4GALT', 'P(K)': 'A4GALT', 'ALPHA4GNT': 'A4GNT', 'FLJ12389': 'AACS', 'SUR-5': 'AACS', 'ACSF1': 'AACS', 'AACSL': 'AACSP1', 'DAC': 'AADAC', 'CES5A1': 'AADAC', 'MGC72001': 'AADACL2', 'OTTHUMG00000001887': 'AADACL3', 'OTTHUMG00000001889': 'AADACL4', 'KATII': 'AADAT', 'KAT2': 'AADAT', 'KYAT2': 'AADAT', 'FLJ11506': 'AAGAB', 'P34': 'GTF2H3', 'KIAA1048': 'AAK1', 'DKFZP686K16132': 'AAK1', 'C11ORF67': 'AAMDC', 'PTD015': 'AAMDC', 'FLJ21035': 'AAMDC', 'CK067': 'AAMDC', 'SNAT': 'AANAT', 'C20ORF4': 'AAR2', 'BA234K24.2': 'AAR2', 'C8ORF85': 'AARD', 'LOC441376': '