Python module allowing you to do various searches for links on the Web.
pip3 install websearch-python
OR you can install dev version
pip3 install https://github.com/iTeam-S/WebSearch/archive/refs/heads/main.zip
from websearch import WebSearch as web
for page in web('iTeam-$').pages[:2]:
print(page)
[RESULTS]
https://iteam-s.mg/
https://github.com/iTeam-S
# run webserver
websearch --host 0.0.0.0 --port 7845
OR
# run webserver
python -m websearch --host 0.0.0.0 --port 7845
# requests contents
curl http://0.0.0.0:7845/pages/botoravony+arleme
[
"https://portfolio.iteam-s.mg/?id=2",
"https://portfolio.iteam-s.mg/libs/cv/arleme.pdf",
"https://madagascar.webcup.fr/team-webcup/iteams"
]
curl https://websearch-python.herokuapp.com/pages/botoravony+arleme
FULL DOCUMENTATION
from websearch import WebSearch
web = WebSearch('Gaetan Jonathan BAKARY')
You can pass a list
for mutliple keyword.
web = WebSearch(['Gaetan Jonathan BAKARY', 'iTeam-S'])
You can also specify a website
as a reference.
web = WebSearch('Gaetan Jonathan', site='iteam-s.mg')
from websearch import WebSearch
web = WebSearch('Gaetan Jonathan BAKARY')
webpages = web.pages
for wp in webpages[:5]:
print(wp)
[RESULTS]
https://mg.linkedin.com/in/gaetanj
https://portfolio.iteam-s.mg/?u=gaetan
https://github.com/gaetan1903
https://medium.com/@gaetan1903
https://gitlab.com/gaetan1903
from websearch import WebSearch
web = WebSearch('Gaetan Jonathan BAKARY')
webimages = web.images
for im in webimages[:5]:
print(im)
[RESULTS]
https://tse3.mm.bing.net/th?id=OIP.-K25y8TqkOi9UG_40Ti8bgAAAA
https://tse1.mm.bing.net/th?id=OIP.yJPVcDx6znFSOewLdQBbHgHaJA
https://tse3.mm.bing.net/th?id=OIP.7rO2T_nDAS0bXm4tQ4LKQAHaJA
https://tse2.mm.bing.net/th?id=OIP.IUIEkGQVzYRKaDA7WeeV7QHaEF
https://tse3.explicit.bing.net/th?id=OIP.OmvVnMIVu2ZdNZHZzJK_hgAAAA
from websearch import WebSearch
web = WebSearch('Math 220')
pdfs = web.pdf
for pdf in pdfs[:5]:
print(pdf)
[RESULTS]
https://www.coconino.edu/resources/files/pdfs/registration/curriculum/course-outlines/m/mat/mat_220.pdf
https://www.jmu.edu/mathstat/Files/ALEKSmatrix.pdf
https://www.jjc.edu/sites/default/files/Academics/Math/M220%20Master%20Syllabus%20SP18.pdf
https://www.sonoma.edu/sites/www/files/2018-19cat-11math.pdf
https://www.svsd.net/cms/lib5/PA01001234/Centricity/Domain/1009/3.3-3.3B-Practice-KEY.pdf
To prevent the search for attachments with format verification, set verif=False
, which is True
by default.
Format verification is presented here
from websearch import WebSearch
web = WebSearch('Math 220', verif=False)
from websearch import WebSearch:
web = WebSearch('python')
words = web.docx
for word in words[:3]:
print(word)
[RESULTS]
https://www.ocr.org.uk/Images/572953-j277-programming-techniques-python.docx
https://www.niu.edu/brown/_pdf/physics374_spring2021/l1-19-21.docx
https://ent2d.ac-bordeaux.fr/disciplines/mathematiques/wp-content/uploads/sites/3/2017/09/de-Scratch-%C3%A0-Python.docx
from websearch import WebSearch:
web = WebSearch('datalist')
excels = web.xlsx
for excel in excels[:3]:
print(excel)
[RESULTS]
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/979255/Detailed_Single_Data_List_-_2021-2022.xlsx
https://www.jaist.ac.jp/top/data/list-achievement-research-e.xlsx
https://img1.wsimg.com/blobby/go/bed8f8d7-d6c2-488d-9aa3-5910e18aa8d2/downloads/Datalist.xlsx
from websearch import WebSearch:
web = WebSearch('Leadership')
powerpoints = web.pptx
for powerpoint in powerpoints[:3]:
print(powerpoint)
[RESULTS]
https://www.plainviewisd.org/cms/lib6/TX01918200/Centricity/Domain/853/Leadership%20Behav.%20Styles.pptx
https://www.yorksandhumberdeanery.nhs.uk/sites/default/files/leadership_activity_and_msf.pptx
https://www.itfglobal.org/sites/default/files/node/resources/files/Stage%203.1%20Powerpoint.pptx
from websearch import WebSearch
web = WebSearch('Finance')
documents = web.odt
for doc in documents[:2]:
print(doc)
[RESULTS]
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/970748/Green_Finance_Report.odt
https://iati.fcdo.gov.uk/iati_documents/3678707.odt
from websearch import WebSearch
web = WebSearch('Commerce')
documents = web.ods
for doc in documents[:2]:
print(doc)
[RESULTS]
http://www.justice.gouv.fr/art_pix/Stat_RSJ_12.7_Civil_Les_tribunaux_de_commerce.ods
https://www.insee.fr/fr/metadonnees/source/fichier/Precision-principaux-indicateurs-crise-sanitaire-2020.ods
from websearch import WebSearch
web = WebSearch('Renaissance')
documents = web.odp
for doc in documents[:2]:
print(doc)
[RESULTS]
http://ekladata.com/9sHTcbLYfwbNGKU9cpnZXjlsbfA/17-Art-Renaissance.odp
https://www.college-yvescoppens-malestroit.ac-rennes.fr/sites/college-yvescoppens-malestroit.ac-rennes.fr/IMG/odp/diapo-presentation-voyage-5e.odp
from websearch import WebSearch
web = WebSearch('Madagascar')
maps = web.kml
for map in maps[:3]:
print(map)
[RESULTS]
http://www.hydrosciences.fr/sierem/kmz_files/MGPLGRA.kml
https://www.ngoaidmap.org/downloads?doc=kml&name=association-intercooperation-madagascar-aim_projects&partners%5B%5D=6160§ors%5B%5D=1&status=active
https://ngoaidmap.org/downloads?doc=kml&name=nemp-madagascar-cyclone-enawo-response_projects&projects%5B%5D=20655&status=active
For other extensions, not present, use the custom
function
Second arg can be taken here
from websearch import WebSearch
web = WebSearch('Biologie')
ps_documents = web.custom('ps', 'application/postscript')
for doc in ps_documents[:3]:
print(doc)
[RESULTS]
http://irma.math.unistra.fr/~fbertran/Master1_2020_2/L3Court.ps
http://jfla.inria.fr/2002/actes/10-michel.ps
https://www.crstra.dz/telechargement/pnr/ps/environnement/fadel-djamel.ps
you can deploy as webserver and send an http request
python -m websearch --host [host] --port [port]
[*] default host : 0.0.0.0
[*] default port : 7845
Exemple for page:
curl http://<host>:<port>/pages/botoravony+arleme
[
"https://portfolio.iteam-s.mg/?id=2",
"https://portfolio.iteam-s.mg/libs/cv/arleme.pdf",
"https://madagascar.webcup.fr/team-webcup/iteams"
]
Exemple for image:
curl http://<host>:<port>/images/one+piece
[
"https://tse1.mm.bing.net/th?id=OIP.GlNk7idD3RCI_SYLiVzSBAHaE7",
"https://tse2.mm.bing.net/th?id=OIP.uePUN5rwpB-7wicu1uxQcgHaFj",
"https://tse2.mm.bing.net/th?id=OIP.dwWBU-A_6KPvvEYsL2nhVgHaFc",
"https://tse1.mm.bing.net/th?id=OIP.5M8tKIhIWvbqGO1prhUGfAHaJ4",
.....
"https://tse4.mm.bing.net/th?id=OIP.uvp3efwHRLDJnUWZ5KLWCwHaE8",
"https://tse3.mm.bing.net/th?id=OIP.d_uUoc-8R13RZ1bb76yhZgHaKp",
"https://tse1.mm.bing.net/th?id=OIP.cBWDvspBM036p6h4DS6RTAHaFj"
}
Search by extension : curl http://<host>:<port>/<extension>/<query>
Where extension is from this list:
swf, pdf, ps, dwf, kml, kmz, gpx, hwp, htm, html, xls, xlsx,
ppt, pptx, doc, docx, odp, ods, odt, rtf, svg, tex, txt, text,
bas, c, cc, cpp, cxx, h, hpp, cs, java, pl, py, wml, wap, xml
Exemple :
curl http://<host>:<port>/kml/madagascar+antananarivo
[
"https://ifl.francophonelibre.org/atelier/ActionOSMMG2019/wms/kml?layers=ActionOSMMG2019:MG_Antananarivo_pharmacy_point_OSM_20190427"
]
You can use the parameter limit
to limit results
curl http://<host>:<port>/images/one+piece?limit=4
[
"https://tse1.mm.bing.net/th?id=OIP.GlNk7idD3RCI_SYLiVzSBAHaE7",
"https://tse2.mm.bing.net/th?id=OIP.uePUN5rwpB-7wicu1uxQcgHaFj",
"https://tse2.mm.bing.net/th?id=OIP.dwWBU-A_6KPvvEYsL2nhVgHaFc",
"https://tse1.mm.bing.net/th?id=OIP.5M8tKIhIWvbqGO1prhUGfAHaJ4"
]
curl http://<host>:<port>/pdf/statut?verif=false&site=iteam-s.mg
Give a star 🌟 if this project helped you!
MIT License
Copyright (c) 2021 iTeam-$